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CHAPTER  1: 
INTRODUCTION 


1.1  The  Chat  Domain 

Since  its  introduction  in  the  late  1980s,  Internet  Relay  Chat  (IRC)  has  become  popular  world¬ 
wide  as  a  means  of  real-time  communications.  With  hundreds  of  thousands  of  users  each  day, 
the  volume  of  data  created  is  overwhelming  for  complete  human  analysis.  Natural  Language 
Processing  (NLP)  techniques  can  be  applied  for  applications  such  as  social  networking  analysis, 
data-mining  and  detection  of  illicit  uses. 

1.1.1  General  Chat  Characteristics 

IRC  chat  “rooms”  are  hosted  on  servers  around  the  world.  Some  of  these  rooms  are  devoted 
to  specific  topics,  while  others  are  simply  gathering  places  for  social  interaction.  Users  log  in 
to  rooms  of  their  choosing  and  select  an  alias  for  self-identification.  They  are  then  free  to  type 
their  inputs,  which  are  then  broadcasted  to  all  participants  in  the  room.  There  are  also  functions 
available  that  allow  users  to  hold  “private”  conversations  with  other  selected  users.  In  these 
private  rooms,  only  those  users  invited  to  participate  see  others’  posts. 

Due  to  IRC’s  synchronous  nature,  users  may  provide  their  inputs  at  any  time.  There  is  no 
requirement  for  turn-taking  as  commonly  found  in  spoken  dialog.  Hence,  IRC  data  streams 
frequently  consist  of  multiple,  interleaved  conversations  further  complicating  analysis.  For 
example,  when  a  user  presents  a  question  it  is  available  to  all  users  present  in  the  respective 
chat  room.  The  next  item  appearing  in  the  stream  may  not  be  a  response  to  this  question  and 
may,  in  fact,  be  another,  unrelated  question  by  a  different  participant.  Correlating  questions 
and  subsequent  answers  becomes  a  difficult  task,  particularly  in  active  chat  rooms  with  many 
participants.  The  problem  of  identifying  who  said  what  to  whom  is  called  conversational  thread 
extraction.  A  good  source  for  understanding  this  problem  and  other  chat  specific  issues  is 
Adams  [1]. 

Because  chat  users  are  not  generally  constrained  by  strict  language  semantics  or  structure,  the 
task  of  identifying  questions  amongst  other  types  of  posts  is  also  difficult.  While  traditional 
written  language  contains  punctuation  (question  marks)  that  identify  illocutionary  (or  dialog) 
acts  as  questions,  these  clues  are  frequently  missing  in  chat  messages.  Identifying  questions,  as 
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opposed  to  other  dialog  acts,  is  therefore  a  difficult  task.  The  ad-hoc  nature  of  chat  usage  also 
results  in  unique  features,  including  symbols  intended  to  convey  emotions  (“emoticons”)  and 
intentionally  misspelled  words,  not  typically  found  in  traditional  language  usage.  As  a  result, 
parsing  algorithms  that  are  trained  on  structured  language  examples  perform  poorly  in  the  chat 
domain. 

Previous  work  in  the  chat  domain  has  focused  on  part  of  speech  and  dialog  act  tagging  as 
a  foundation  for  higher-level  analysis.  These  tasks  include  conversational  thread  extraction, 
data-mining  and  social-networking  analysis.  Due  to  the  aforementioned  structural  differences 
between  chat  data  and  oral  or  written  data,  these  tasks  are  not  easily  automated  and  human 
interaction  is  frequently  required.  The  volume  of  data  created  by  heavily  populated  chat  servers, 
however,  makes  such  human  involvement  infeasible.  Development  of  NLP  techniques  to  assist 
in  these  tasks,  in  this  particular  domain,  is  therefore  desirable. 

While  this  thesis  is  focused  on  IRC  data,  the  techniques  apply  to  any  chat  system  such  as  Yahoo, 
AOL  Instant  Messenger  or  even  military  applications  such  as  tactical  chat. 


Chat  in  the  Military  Domain 

Just  as  chat  is  a  popular  form  of  communication  for  the  general  public,  tactical  military  chat 
has  become  an  important  command  and  control  tool  for  forces  operating  around  the  world  [2] . 
The  topics  discussed  in  these  chat  sessions  are  more  focused  toward  tactical  situations  and  are 
structured  with  user  names  derived  from  assigned  user  duties.  This  additional  structure  may 
provide  information  useful  for  higher  level  analysis  such  as  post-event  reconstruction.  The 
information  derived  from  this  data  may  then  be  used  to  document  lessons  learned  for  follow-on 
tactical  performance  improvements. 

Eovito  provided  functional  requirements  for  tactical  military  chat.  Eo vito’s  work  included  items 
that  we  believe  would  benefit  from  inclusion  of  dialog  act  information  such  as  “Thread  Popu¬ 
lation/Repopulation,”  “Suppress  System  Event  Messages,”  and  “User  Access  to  Chat  Logs.”[2] 
Consider  the  possibility  of  a  chat  participant  being  able  to  determine  who  has  asked  what  ques¬ 
tions  and  what  answers  were  provided  without  interrupting  other  users.  These  functions  may 
serve  to  filter  undesired  noise  from  the  conversation  thereby  increasing  the  rate  of  acquiring 
situational  awareness.  We  believe  that  such  an  enhanced  filter  may  benefit  from  automatically 
produced  dialog  act  information. 
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1.2  Purpose  of  this  Thesis 

This  thesis  provides  an  improved  method  for  dialog  act  tagging  chat  posts.  We  show  that  the 
use  of  maximum  likelihood  estimation  part  of  speech  tags  nearly  equal  the  performance  of 
computationally  expensive,  human  verified  parts  of  speech  in  determining  dialog  act  tags  in 
the  chat  domain.  More  importantly,  our  methodology  demonstrates  that  maximum  likelihood 
estimation  part  of  speech  tags  from  a  fundamentally  different,  labeled  domain  work  very  well 
in  the  chat  domain.  This  is  very  important,  not  just  for  analysis  of  chat,  as  it  bodes  well  for  new 
domains  of  Internet  communications  as  they  are  invented,  deployed  and  developed. 

In  fact,  this  thesis  represents  new  work  in  the  important  field  of  cross-genre  machine  learning. 
We  show  that  previous,  human-involved  investments  in  another  genre  can  be  effectively  applied 
to  produce  acceptable  results  in  the  chat  domain.  Our  work  should  serve  as  a  foundation  for 
other  research  in  the  rapidly  expanding  field  of  computer  communications. 


1.3  Organization  of  Thesis 

This  thesis  is  organized  as  follows: 


•  Chapter  1  discusses  computer-mediated  communications  and  motivation  for  this  thesis. 
We  include  a  brief  overview  of  the  chat  domain  and  specific  challenges  to  analysis  of  the 
data  found  there. 

•  Chapter  2  contains  background  information  on  Internet  Relay  Chat,  previous  research 
into  chat  analysis  and  the  machine  learning  techniques  used  in  this  work. 

•  Chapter  3  includes  our  experimental  approach  to  dialog  act  tagging  chat  posts.  This 
chapter  includes  discussions  on  the  sources  of  data  for  our  part  of  speech  tagger,  training 
and  test  data.  We  describe  a  cross-genre  methodology  (one  that  uses  data  derived  from 
a  different  domain)  that  effectively  determines  dialog  act  tags  in  the  chat  domain.  Also 
included  are  specific  details  about  feature  selection  and  our  experimental  approach. 

•  Chapter  4  provides  the  results  of  our  work  in  dialog  act  tagging  chat  posts.  We  also 
provide  statistical  significance  test  results  for  our  data.  Additionally,  we  include  results 
of  experiments  to  that  our  results  are  not  skewed  by  individual  author  contributions  to  the 
chat  data. 
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•  In  Chapter  5,  we  summarize  our  results  and  provide  recommendations  for  future  work  in 
improving  machine  learning  approaches  to  determine  dialog  act  tags  in  chat  posts. 
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CHAPTER  2: 
BACKGROUND 


2.1  Online  Chat 

The  proliferation  of  computers  and  increased  Internet  availability  have  produced  new  means 
for  connecting  socially  and  professionally.  Some  of  these  new  forms  of  information  exchange, 
in  which  users  pass  typed  messages  to  one  or  more  other  users,  are  referred  to  as  computer- 
mediated  communications  [3].  Internet  Relay  Chat  (IRC)  is  one  popular  form  of  computer- 
mediated  communication. 

Chat  “rooms”  provide  a  stage  upon  which  users  can  express  thoughts  via  typewritten  messages 
called  “chat  posts”  or,  simply,  “posts.”  These  posts  are  broadcast  to  all  subscribers  logged  into 
the  respective  chat  room.  Posts  may  be  composed  at  any  time  and  are  broadcast  in  the  order 
they  are  received,  interlacing  conversations  between  distinct  users  and  general  announcements 
meant  for  all  participants. 

Previous  works  by  Herring  and  Kucukyilmaz  noted  that  the  structure  of  chat  posts  differs  from 
that  of  written  text  and  also  from  that  of  spoken  language  [3,  4].  Examples  of  specific  dif¬ 
ferences  include  the  use  of  emoticons  (see  Appendix  A),  flexible  grammatical  rules  including 
punctuation  and  spelling,  and  the  intentional  use  of  misspelled  words  to  convey  emotion  or 
emphasis.  These  differences  present  unique  challenges  when  analyzing  higher-order  character¬ 
istics  of  chat  posts  such  as  classification  of  dialog  act  and  semantic  meaning. 

2.2  Prior  and  Related  Work 

In  2006,  Lin  collected  and  preserved  over  477,000  chat  posts  from  an  Internet  chat  site.  The 
source  material  was  saved  from  chat  rooms  that  were  organized  by  user  age  groups,  and  this 
organization  was  maintained.  These  chat  rooms  were  not  limited  to  particular  topics  [5].  The 
goal  of  Lin’s  work  was  an  attempt  to  identify  any  sexual  predators  actively  participating  in  these 
chat  rooms. 

Lorsyth  followed  Lin  with  a  primary  goal  of  using  machine  learning  algorithms  to  apply  part- 
of-speech  tags  to  chat  posts,  and  secondary  goal  of  exploring  potential  techniques  for  automatic 
dialog  act  tagging  of  chat  posts.  In  the  course  of  his  work,  Lorsyth  removed  all  personally  iden- 
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tifiable  information  from  10,567  chat  posts  sampled  from  different  chat  rooms.  This  privatized 
subset  of  Lin’s  work  has  become  known  as  the  Naval  Postgraduate  School  (NPS)  chat  corpus. 
Forsyth  tagged  the  NPS  chat  corpus  with  parts-of-speech  and  dialog  act  tags  using  a  bootstrap¬ 
ping  method  followed  by  verification  by  humans  [6].  For  part-of-speech  tagging,  he  used  the 


cc 

Coordinating  conjunction 

PRP$ 

Possessive  pronoun 

CD 

Cardinal  number 

RB 

Adverb 

DT 

Determiner 

RBR 

Adverb,  comparative 

EX 

Existential  there 

RBS 

Adverb,  superlative 

FW 

Foreign  word 

RP 

Particle 

IN 

Preposition  or  subordinating  conjunction 

SYM 

Symbol 

JJ 

Adjective 

TO 

to 

JJR 

Adjective,  comparative 

UH 

Interjection 

JJS 

Adjective,  superlative 

VB 

Verb,  base  form 

LS 

List  item  marker 

VBD 

Verb,  past  tense 

MD 

Modal 

VBG 

Verb,  gerund  or  present  participle 

NN 

Noun,  singular  or  mass 

VBN 

Verb,  past  participle 

NNS 

Noun,  plural 

VBP 

Verb,  non-3rd  person  singular  present 

NNP 

Proper  noun,  singular 

VBZ 

Verb,  3rd  person  singular  present 

NNPS 

Proper  noun,  plural 

WDT 

Wh-determiner 

PDT 

Predeterminer 

WP 

Wh-pronoun 

POS 

Possessive  ending 

WP$ 

Possessive  wh-pronoun 

PRP 

Personal  pronoun 

WRB 

Wh- adverb 

Table  2.1 :  Penn  Treebank  Tagset.  From  [7], 


Penn  Treebank  POS  tagset  (see  Table  2.1),  and  dialog  act  tagged  the  NPS  chat  corpus  using 
Wu  et  al.’s  15  post  act  categories  (see  Table  2.2).  Forsyth  compared  the  performance  of  taggers 
based  on  n-grams,  hidden  Markov  models  (both  discussed  in  the  next  section)  and  Brill  taggers 
[6].  Using  his  implementation  of  a  Brill  tagger  trained  on  the  NPS  Chat,  Wall  Street  Journal, 
and  Switchboard  corpora,  Forsyth  achieved  a  90.8%  POS  tagging  accuracy.  In  his  dialog  act 
tagging  effort,  Forsyth  developed  27  features  including  lexical  and  temporal  characteristics  of 
chat  posts  and  the  number  of  chat  users  participating  in  the  chat  room  of  interest  (see  Table 
2.3).  He  compared  the  performance  of  naive  Bayes  (discussed  in  the  next  section)  and  back- 
propagation  neural  networks  in  dialog  act  tagging  accuracy.  Forsyth  recorded  an  83.2%  dialog 
act  tagging  accuracy  using  a  back-propagation  neural  network  with  23  of  these  features  [6]. 
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Tag 

Example 

Count 

Percent 

Statement 

Ell  check  after  class 

3185 

30.14% 

System 

Tom[JADV  11.22.33.44]  has  left#sacbal 

2632 

24.91% 

Greet 

Hi,  Tom 

1363 

12.90% 

Emotion 

lol 

1106 

10.47% 

Yes-No-Question 

Are  you  still  there? 

550 

5.20% 

Wh-Question 

Where  are  you? 

533 

5.04% 

Accept 

I  agree 

233 

2.20% 

Bye 

See  you  later 

195 

1.85% 

Emphasis 

I  do  believe  he  is  right. 

190 

1.80% 

Continuer 

And 

168 

1.59% 

Reject 

I  don’t  think  so. 

159 

1.50% 

Yes-Answer 

Yes,  I  am. 

108 

1.02% 

No- Answer 

No,  I’m  not. 

72 

0.68% 

Clarify 

Wrong  spelling 

38 

0.36% 

Other 

******** 

35 

0.33% 

Table  2.2:  15  Post  Act  Classification  for  Chat.  After  [8].  -  Statistics  from  NPS  Chat  Corpus 


2.3  Machine  Learning  Techniques 

When  we  use  computers  to  analyze  data  derived  from  experience  and  use  this  information  to 
predict  (in  our  case,  to  classify)  new  data,  we  are  performing  machine  learning  [9].  The  vol¬ 
ume  of  Internet  traffic,  specifically  IRC  data,  necessitates  use  of  computers  for  any  meaningful 
attempt  at  analysis  of  the  information  being  transmitted.  Because  IRC  data  is  a  form  of  written 
communication  using  human  language,  the  analysis  of  chat  data  generalizes  to  a  form  of  natural 
language  processing  (NLP).  One  general  goal  of  NLP  is  that  of  classification,  where  we  attempt 
to  determine  some  higher-level  grouping  of  data.  Examples  of  this  effort  include  dialog  act 
tagging,  authorship  detection  and  topic  detection. 

In  the  use  of  computers  to  process  this  type  of  information,  we  identify  features  (e.g.,  words, 
parts-of-speech,  semantic  or  syntactic  structure)  from  which  to  draw  and  test  hypotheses. 


2.3.1  Features 

The  basis  for  classification  of  text  data  must  be  some  set  of  features  whose  analysis  sufficiently 
identifies  a  particular  example’s  class  as  opposed  to  non-classes.  One  common  approach  to 
feature  selection  in  natural  language  processing  is  to  use  the  lexical  items,  sentences,  phrases 
or  words,  in  documents  of  interest.  The  basis  for  probabilistic  methods  used  in  NLP  involves 
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Feature 

Definition 

Rationale 

fO 

Number  of  posts  ago  the  poster  last  posted 

Indicator  for  a  Continuer  act 

ft 

Number  of  posts  ago  the  poster  made  a  spelling  error 

Indicator  for  a  Clarify  act 

f2 

Number  of  posts  ago  that  a  post  contained  a  ’  ?’  but  no 
WRB  or  WP  POS  tag 

Indicator  for  a  Yes  /  No  Answer  act 

f3 

Number  of  posts  in  the  future  that  contained  a  Yes  of  No 
word 

Indicator  for  a  Yes  /  No  Question  act 

f4 

Number  of  posts  ago  that  contained  a  Greet  word 

Indicator  for  a  Greet  act 

f5 

Number  of  posts  in  the  future  that  contained  a  Greet  word 

Indicator  for  a  Greet  act 

f6 

Number  of  posts  ago  that  contained  a  Bye  word 

Indicator  for  a  Bye  act 

f7 

Number  of  posts  in  the  future  that  contained  a  Bye  word 

Indicator  for  a  Bye  act 

f8 

Number  of  posts  ago  that  a  post  was  a  JOIN 

Indicator  for  a  Greet  act 

f9 

Number  of  posts  in  the  future  that  a  post  is  PART 

Indicator  for  a  Bye  act 

flO 

Total  number  of  words  in  post 

Longer  posts  may  be  Statements  and  Ques¬ 
tions,  shorter  posts  may  be  Emotions  and 
Greets/Byes,  etc. 

fit 

First  word  is  a  conjunction,  preposition,  or  ellipses  (POS 
tagof ’CC,”IN,’ or’:’) 

Indicator  for  a  Continuer  act 

f  12 

A  word  contains  emotion  variants  such  as  lol,  ;-),  etc. 

Indicator  for  an  Emotion  act 

fl3 

A  word  contains  hello  or  variants 

Indicator  for  a  Greet  act 

fl4 

A  word  contains  goodbye  or  variants 

Indicator  for  a  Bye  act 

fl5 

A  word  contains  yes  or  variants 

Indicator  for  Yes  or  Accept  acts 

f  16 

A  word  contains  no  or  variants 

Indicator  for  No  or  Reject  acts 

f  1 7 

A  word  POS  tag  is  WRB  or  WP 

Indicator  for  a  Wh-Question  act 

f  1 8 

A  word  contains  one  or  more  ’?’ 

Indicator  for  Wh-  or  Yes/No  Question  acts 

f  19 

A  word  contains  one  or  more  ’ !’  (but  not  a  ’?’) 

Indicator  for  an  Emphasis  act 

f20 

A  word  POS  tag  is  ’X’ 

Indicator  for  an  Other  act 

f21 

A  word  is  a  system  command  (.  or  !  With  SYM  POS  tag) 

Indicator  for  a  System  act 

f22 

A  word  is  a  system  word,  e.g.  JOIN,  MODE,  ACTION, 
etc. 

Indicator  for  a  System  act 

f23 

A  word  is  an  ’any’  variant,  e.g.  ’anyone,’  ’n  e,’  etc. 

Indicator  for  a  Yes/No  Question  act 

f24 

A  word  is  in  all  caps,  but  not  a  system  word  like  JOIN 

Indicator  for  an  Emphasis  act 

f25 

A  word  is  an  ’even’  or  ’mean’  variant 

Indicator  for  a  Clarify  act 

f26 

Total  number  of  users  currently  in  the  chat  room 

More  users  may  stretch  out  distances  be¬ 
tween  adjacency  pairs 

Table  2.3:  Initial  Post  Feature  Set  (27  Features).  From  [6]. 


counting  the  number  of  occurrences  of  selected  features  in  their  respective  classes. 

For  example,  given  a  chat  post  D  =  “he  bought  the  purple  dog,”  we  could  compute  the  proba¬ 
bility  of  D  as  one  item: 

number  of  occurrences  of  D 
total  number  of  posts  in  corpus 

or,  if  we  consider  each  word  as  a  random  variable,  we  could  simplify  this  task  by  estimating  the 
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probability  of  D  using  the  chain  rule  for  joint  probability: 


P(he  bought  the  purple  dog)  = 

P(dog|he  bought  the  purple)  x  P(purple|he  bought  the) 
xP (the | he  bought)  x  P(bought|he)  x  P(he|start)  x  P(start) 

But  this  becomes  cumbersome  in  that  it  would  require  us  to  maintain  all  the  probabilities  of  all 
words  given  all  observed  previous  words.  We  can  simplify  this  further  by  making  the  assump¬ 
tion  that  the  probability  of  each  word  is  dependent  only  on  a  limited  number  of  previous  words. 
This  is  known  as  the  Markov  assumption  and  it  is  used  frequently  in  NLP  [10].  For  example, 
if  we  estimate  the  probability  of  D  based  on  only  using  one  previous  feature  (or  word  in  this 
example): 


P(he  bought  the  purple  dog)  « 

P(dog|purple)  x  P(purple|the)  x  P(the|bought) 
xP(bought|he)  x  P(he|start)  x  P(start) 


or,  more  generally: 

n 

P(hh-fn)  ~  P(h)  If  P(ft l/fc-l)  (2.1) 

k=  1 

If  we  choose  to  estimate  the  probability  of  sentences  based  on  zero  previous  words,  we  simply 
maintain  the  probability  of  each  individual  word  and  multiply  using  the  chain  rule.  In  this  case, 
our  features  are  called  a  “bag  of  words”  since  the  order  is  not  important. 

We  primarily  use  n-grams  where  n  €  (1,  2,  3}  and  indicates  the  number  of  individual  data 
elements  included  in  each  feature.  Throughout  this  document,  1 -grams  (or  items  in  the  afore¬ 
mentioned  “bag  of  words”)  are  referred  to  as  unigrams,  2-grams  as  bigrams  (these  correspond  to 
equation  2.1)  and  3-grams  as  trigrams  [10].  In  addition  to  using  n-grams  made  up  of  individual 
words,  we  examine  the  potential  of  classifying  posts  by  dialog  act  using  part-of-speech  n-grams. 
For  our  experiments,  we  examined  the  use  of  1,2  and  3-grams  consisting  of  parts-of-speech  tags 
and  1  and  2-grams  of  lexical  items  (the  words  themselves.) 

Parts  of  Speech  (POS) 

Because  of  the  aforementioned  relaxation  of  spelling,  grammar  and  punctuation  rules  in  chat, 
automatic  POS  tagging  in  the  chat  domain  has  been  the  focus  of  other  efforts  [6, 11].  Traditional 
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POS  taggers  apply  a  variety  of  approaches  to  identify  each  word’s  part-of-speech  as  determined 
by  the  nature  of  the  word  and  the  context  in  which  it  is  used.  Many  words  in  the  English 
language  are  appropriately  tagged  with  different  parts  of  speech  depending  on  how  they  are 
used.  For  example,  “flies”  may  either  be  tagged  as  a  plural  noun  if  it  is  used  to  refer  to  common 
insects  (“He  swatted  the  flies”)  or  a  present  tense  verb  when  describing  what  an  airplane  does 
(“An  airplane  flies”)  This  disambiguation  is,  in  general,  computationally  expensive. 

Forsyth  part-of-speech  tagged  the  anonymized  portion  of  the  NPS  chat  corpus  using  the  Penn 
Treebank  system  of  tags  (see  Table  2.1.)  For  his  work,  Forsyth  used  27  features  to  compare 
performance  of  naive  Bayes,  hidden  Markov  model,  and  Brill  taggers  in  the  determination  of 
chat  parts-of-speech  classification. 


Dialog  Acts 

Stolcke  suggested  that  a  useful,  first  level  of  detail  in  the  analysis  of  discourse  structure  is  dialog 
act  identification  [12].  For  example,  because  of  chat’s  aforementioned  broadcast  structure  and 
interlaced  conversations,  dialog  acts  have  been  shown  to  provide  some  assistance  in  conversa¬ 
tional  thread  extraction  [11]  or  determination  of  conversation  meaning  [13]. 

2.3.2  Naive  Bayes  Classifiers 

Naive  Bayes  classifiers  are  a  form  of  supervised  learning.  This  type  of  algorithm  requires 
labeled  data  for  training.  Across  all  of  the  labeled  classes,  we  can  determine  the  probability  of 
each  dialog  act  by  counting  the  example  posts  of  each  category  and  dividing  by  the  total  number 
of  posts  used  for  training: 

number  of  training  set  examples  of  Cj 

JL  (  W  n  I  - 

total  number  of  posts  in  training  set 

This  value  is  referred  to  as  the  “prior”  probability  of  the  class  Cj  in  the  training  set  and  it  is  an 
important  part  of  our  classifier  as  seen  below. 

We  label  the  count  of  feature  /,  as  count(fi).  Then  the  probability  of  f,  occurring  in  dialog 
act  class  C,  of  words  is  P(  /',)  =  — = — .  We  use  these  counts  in  the  form  of  a 
feature  vector  F  =  {P(f1),  P(/2), ...,  P(fn)}.  Because  we  have  computed  these  feature  counts 
from  each  dialog  act  class,  we  condition  the  counts  on  the  feature  giving  us  P(F\C).  However, 
our  classification  task  requires  us  to  compute  P(C\F).  To  do  this  we  apply  Bayes  Rule,  which 
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states: 


P{C\F) 


P(C)P(F\C) 

W) 


The  task  for  our  Naive  Bayes  classifier  is  then  to  find  the  class  Cj  that  maximizes  P(C\F).  We 
call  the  class  so  identified  by  our  classifier  C  where: 


C 


P(C)P(F\C ) 
argmax  - =; - 

C  S  Classes  P^F^j 


Note  that  we  compare  the  probability  of  a  feature  vector  through  all  the  classes.  Thus  the 
denominator,  our  feature  vector,  does  not  change  between  classes.  Because  the  denominator 
behaves  as  a  constant,  and  division  by  a  constant  does  not  change  the  relative  results  across 
classes,  we  can  simplify  the  equation  for  our  Naive  Bayes  classifier  as: 


C  =  argmax  P(C)P(F\C) 

C  G  Classes 

A  critical  assumption  made  in  the  use  of  Naive  Bayes  classifiers  is  that  each  feature  in  the 
feature  vector  is  independent  of  every  other  feature.  This  assumption  means  that: 


p(f\q  =  n  p(ft\c) 

fk  6F 

and  our  final  equation  for  the  Naive  Bayes  classifier  becomes: 


C=  argmax  P(C)  JJ  P{fk\C) 

C  G  Classes  , 

k= 1 

Note  that  the  first  term  in  the  equation  is  our  class  prior  as  discussed  above. 

One  limitation  of  digital  computers  arises  here.  Note  that  because  we  are  potentially  multiplying 
many  probabilities,  and  probabilities  are  less  than  or  equal  to  one,  we  may  rapidly  generate 
a  number  that  is  too  small  for  a  computer  to  represent.  Hence  it  is  common  to  map  these 
probabilities  via  logarithms  and,  exploiting  the  properties  of  logarithms,  we  add  these  log- 
probabilities  instead  of  multiplying  the  actual  probabilities.  Our  equation  becomes: 


n 

C  =  argmax  log  P(C)  +  ^  log  P(fk\C) 

C  G  Classes  7  - 
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2.3.3  Smoothing 

One  issue  with  applying  naive  Bayes  as  above  is  that  we  must  address  is  the  probability  of 
encountering  features  not  seen  during  training  (or  “unseen”  events).  If  we  simply  try  to  assign 
these  new  features  no  value  (or  0),  our  product  rule  would  produce  zero  for  an  entire  case  when 
encountering  an  unseen  event.  Similarly,  since  the  logarithm  of  a  zero  value  is  undefined,  our 
summation  including  the  log  of  zero  is  undefined. 

In  order  to  account  for  the  possibility  that  we  will  encounter  events  unseen  in  training,  we 
implement  techniques  that  assign  some  minute  probability  to  these  features.  This  process  is 
called  “smoothing.”  Because  we  are  dealing  with  probabilities  and  they  must  sum  to  1,  the  idea 
in  smoothing  techniques  is  to  take  a  small  amount  of  probability  mass  from  the  features  we  have 
seen  and  give  it  to  the  features  we  have  not  seen  [14]. 

Add-One  Smoothing  (Also  Known  as  “Laplace  Smoothing”) 

This  method  introduces  some  variability  in  the  science  of  our  data.  Add-One  smoothing,  as  the 
name  implies,  adds  one  to  every  count.  The  features  we  have  seen  are  treated  as  if  they  have 
been  seen  one  additional  time  and  unseen  features  (that  had  zero  counts  in  training)  are  given  a 
value  as  if  we  had  seen  each  one  time.  Typically,  we  define  Add-One  Smoothing  in  terms  of: 

T  :  the  number  of  unique  types  we  have  observed 
N  :  the  total  number  of  tokens  we  have  observed 
V  :  the  size  of  the  vocabulary 

Z  :  the  number  of  types  we  have  not  seen  (Z  =  V  —  T) 

Because  we  added  one  to  every  feature,  our  total  count  must  now  be  N  +  V  to  make  room 
in  the  total  probability  for  all  our  features.  We  denote  the  smoothed  probability  as  P*.  Our 
smooth  probability  of  feature  f\  now  becomes  P*  =  and  we  assign  all  unseen  features 

the  probability  [  14]. 

Witten-Bell  Smoothing 

This  type  of  smoothing  uses  a  frequentist  approach  in  an  attempt  to  capture  an  estimate  of  the 
probability  of  seeing  a  feature  for  the  first  time.  Using  the  same  notation  as  above,  the  sum 
of  the  probabilities  of  seeing  features  for  the  first  time  is  assigned  as  As  above,  Z  is 

all  the  vocabulary  words  we  have  not  seen  (and  thus  have  no  probability  data  for.)  Then  each 
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unseen  word  will  be  assigned  (\)th  the  total  value  or  •  Using  Witten-Bell  smoothing, 

for  features  we  have  seen  we  use  coa^^  for  the  probability  of  feature  ft  [15].  Succinctly: 


Z(N+T)  if  count{fi )  =  0 
cou^i]  if  count (fi)  >  0 


2.3.4  Hidden  Markov  Models 

Hidden  Markov  models  (HMMs)  are  frequently  used  in  part-of-speech  tagging  with  the  hidden 
states  emitting  the  respective  POS.  We  are  not  interested  in  POS  tagging.  Instead  we  imple¬ 
mented  HMMs  in  order  to  determine  if  they  could  provide  useful  information  in  dialog  act 
classification.  HMMs  are  discussed  in  [9,  10,  14,  16]. 


Baum-Welch  Algorithm  (input  training  sequence  O,  output  HMM  H  =  (i r.  A,  B,  N,  M )) 
Goal:  Iteratively  estimate  model  parameters  A,  B,  n 

Define:  Pt{i,j),  1  <  t  <  T,  1  <  i,j  <  N  where  T  is  length  of  O,  N  is  number  of  hidden 
states 

Step  1:  Let  initial  model  be  po  =  (Ao,  f?o,  7To) 

Step  2: 


P{Ot  =  fOt+i  =  j,Q\u) 
P(0\u) 


&i{t)aijbjj0tf3j(t  +  1) 


Define: 


JV 

7 t(i)  =  Pt(i,j)  the  probability  of  being  in  state  i  at  time  t  given  O. 

f=i 

T 

7 i(t)  =  expected  number  of  transitions  from  state  *  in  O 

t- 1 
T 

yy  PtX'i-  j)  =  expected  number  of  transitions  from  state  i  to  j  in  O 

t=i 


7T =  7j(l)  =  expected  frequency  in  state  i  at  time  t  =  1 

„  expected  number  of  transitions  from  state  i  to  state  j 
'  '  expected  number  of  transitions  from  i 

expected  number  of  transitions  from  i  to  j  with  observed  token  k 


bijk  — 


expected  number  of  transitions  from  i  to  j 


If  log  P(0\fi)  —  logP(O\p,0 )  <  e  return  fi 
else  po  =  A  and  g°t0  Step  2. 


Figure  2.1 :  Baum-Welch  Algorithm.  After  [1 6]. 
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An  HMM  H  consists  of  a  five-tuple  H  =  {II,  A,  B,  N,  M}  where  II  represents  the  probabili¬ 
ties  for  each  initial  state,  A  is  the  set  of  state  transition  probabilities,  B  is  the  set  of  emission 
probabilities,  IV  is  a  set  of  hidden  states,  and  M  is  the  symbol  alphabet. 

H  is  trained  by  use  of  a  sequence  of  tokens  derived  from  all  tokens  observed  during  training. 

The  parameters  of  the  language  model  //  =  {II,  A,  Bj  are  learned  through  a  form  of  expectation 
maximization.  In  this  methodology,  we  begin  by  estimating  the  parameters  (the  expectation 
step)  and  then  use  the  maximization  step  to  determine  the  likelihood  of  the  training  sequence 
given  the  estimated  parameters.  We  then  determine  relative  importance  of  the  proposed  model’s 
transition  and  emission  probabilities  and  use  this  information  to  produce  new  parameters  for 
the  model.  By  iteratively  improving  fi’s  parameters,  we  improve  the  overall  performance  of 
the  HMM  until  the  magnitude  of  the  changes  falls  below  a  defined  threshold.  For  HMMs,  this 
iterative  algorithm  is  called  the  Baum-Welch  or  Forward-Backward  algorithm  (see  Figure  2.1) 
[16]. 

Though  subjected  to  settling  at  local  maxima,  the  expectation  maximization  approach  has 
proven  effective  for  use  in  the  training  of  Hidden  Markov  Models.  When  the  language  model 
has  been  determined  through  training,  the  HMM  uses  the  calculated  fi  and  processes  observa¬ 
tion  sequences  ( O  where  0  =  (cq,  o2, ...,  on)  and  ok  e  M )  derived  from  test  cases  (for  this  work 
these  consist  of  individual  chat  posts.)  The  Viterbi  algorithm,  a  form  of  dynamic  programming, 
is  then  used  to  determine  the  probability  of  observing  a  respective  test  case  given  H . 

2.3.5  Support  Vector  Machines 

Support  Vector  Machines  (SVMs)  are  discussed  in  [9,  10,  14].  In  general,  SVMs  produce  a 
discriminant  classifier  that  attempts  to  find  boundaries  that  separate  two  distinct  classes  of  data. 
Because  we  may  have  an  infinite  number  of  boundaries  that  satisfy  this  requirement  (see  Figure 
2.3),  SVM  further  refines  the  solution  to  the  boundary  that  maximizes  the  distance  between  the 
data  points  closest  to  the  proposed  boundary  (see  Figure  2.4).  Hence,  SVM  is  also  referred  to 
as  a  maximum  margin  classifier.  Note  that  the  data  points  closest  to  the  boundary,  those  whose 
margin  we  are  maximizing,  are  called  the  support  vectors.  The  boundaries  produced  by  SVM 
classifiers  are  of  dimension  n  —  1  where  n  corresponds  to  the  dimensions  of  the  data  points 
themselves.  Therefore,  for  two  dimensional  space,  SVM  attempts  to  find  a  line  separating  the 
class  examples  from  the  non-class  examples,  and  for  three  dimensional  data,  the  algorithm 
attempts  to  find  a  boundary  in  the  form  of  a  plane.  Above  three  dimensions,  SVM  boundaries 
are  called  hyperplanes. 
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Viterbi  Algorithm  (O,  HMM  II  =  (II,  A,  B,  N,  M)) 

Goal:  Find  the  most  probably  state  sequence  X  =  argmaxY  P(X\0,  /x ) 

it  is  sufficient  to  maximize  argmax^-  P(X,  O |/x) 

define: 

Sj(t)  =  .max  P(X1,...,Xt-1,oi,...,ot-i,Xt=j\(i) 

Step  1:  Initialization 

(5,  (1)  =  TTj ,  1  <  j  <  N 

Step  2:  Induction 

Sj(s  +  1)  =  max  Si(t)aijbij0t,  1  <  j  <  N 

J  l<i<N  J  J 

Store  backtrace 

V’ j(t  +  1)  =  argmaxSi(t)oijbij0t,l  <  j  <  N 

l<i<N 

Step  3:  Termination  and  path  readout  (by  backtracking).  The  most  likely  state  sequence  is 
worked  out  from  the  right  backwards: 

XT+\  =  argrnax  St  (T  +  1) 

1  <i<N 

Xt  =  ^xt+1  (t  +  1) 

P(X)  =  max  Si(T+  1) 

l<i<N 


Figure  2.2:  Viterbi  Algorithm.  After  [16]. 


If  the  data  is  not  linearly  separable,  support  vector  machines  may  apply  a  kernel  function  to  the 
data  points.  This  results  in  added  dimensionality  of  the  resulting  data  and  may  provide  linearly 
separable  points  in  the  new  feature  space  [18]. 

2.3.6  Decision  Trees 

Classification  using  decision  trees  is  discussed  in  [9,  14,  19,  20,  21].  This  classification  method 
uses  successive  questions  about  dataset  attributes  to  reduce  the  possible  selections  for  our  clas¬ 
sifier  until  a  determination  is  achieved.  At  the  root  of  the  tree,  all  classes  are  considered  possible 
and  a  question  is  asked  regarding  the  data  features.  For  a  binary  decision  tree,  this  is  a  yes  or 
no  question  whose  answer  leads  to  another  node  with  a  subsequent  question.  Ideally,  when  a 
node’s  question  determines  the  class  of  a  test  case,  the  answer  leads  to  a  leaf  node  that  returns 
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Figure  2.3:  Example  Separators.  From  [17]. 


the  classifier’s  result. 

Rather  than  constructing  a  decision  tree  by  randomly  selecting  questions,  Quinlan  advocates 
a  method  of  inductively  creating  decision  trees  based  on  using  a  measure  of  maximum  infor¬ 
mation  gain  [22].  Called  the  ID3  algorithm,  it  starts  at  the  root  of  the  tree  and  develops  from 
the  top  down  recursively.  If,  at  a  node,  the  data  belongs  to  only  one  subset,  the  tree  classifies 
test  data  leading  to  this  node  as  belonging  to  that  subset.  If  questions  are  available  to  divide 
the  subset  further,  the  question  providing  the  highest  information  gain  is  selected  and  the  new 
subsets  become  nodes  on  the  next  lower  level.  If  there  are  no  questions  that  further  segregate 
the  data,  the  node  becomes  a  leaf  and  classifies  and  examples  that  lead  to  this  leaf  as  belonging 
to  the  most  likely  class  included  in  the  remaining  subsets. 

Decision  trees  are  hampered  by  several  issues  including  overfitting  and  “...handling  continu¬ 
ous  attributes,  choosing  an  appropriate  attribute  selection  measure,  handling  training  data  with 
missing  attribute  values,  handling  attributes  with  differing  costs,  and  improving  computational 
efficiency”  [20].  To  address  some  these  issues,  Quinlan  modified  ID3  by  using  reduced-error 
pruning.  This  method  considers  each  node  of  the  tree  and  if  removal  of  the  node  does  not  reduce 
the  performance  of  the  tree  when  validation  data  is  tested,  it  is  removed  and  a  leaf  node  that 
returns  the  most  likely  of  the  classes  remaining  is  installed.  Note  that  this  requires  the  training 
data  to  be  divided  into  a  training  set  and  a  validation  set  which  is  not  desirable  for  small  training 
sets. 
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/ 

Figure  2.4:  Maximum  Margin  Hyperplane.  From  [17]. 


In  1993,  Quinlan  introduced  the  C4.5  algorithm  as  an  extension  of  ID3.  This  change  functions 
as  in  ID3  but  refines  the  resulting  tree  by  creating  a  rule  for  each  path  in  the  tree,  generalizing 
each  rule  if  possible  then  sorting  the  rules  by  comparing  their  estimated  accuracy.  In  his  estimate 
of  rule  accuracy  in  C4.5,  Quinlan  uses  the  training  set  to  determine  each  rule’s  accuracy  and 
applies  a  penalty  to  better  estimate  test  performance  [20].  The  test  data  is  then  classified  using 
these  rules. 

2.3.7  Maximum  Entropy 

Application  of  Maximum  Entropy  techniques  in  NLP  are  discussed  in  [23,  24].  These  tech¬ 
niques  are  based  on  making  no  arbitrary  assumptions  about  the  data  to  be  classified.  Given  no 
information  about  a  data  set  with  N  classes,  in  order  to  avoid  making  undue  assumptions,  we 
would  require  that  the  probability  of  an  element  x  belonging  to  class  Cj  is  uniformly  distributed 
across  all  classes,  thus  p(cj\x)  =  ^  where  1  <  j  <  N.  If  we  discover  some  piece  of  evidence 
during  training  that  would  indicate  that  x  is  more  likely  to  belong  to  a  subset  of  one  classes, 
then  the  probabilities  of  the  classes  belonging  to  this  subset  are  promoted  [23].  The  classes  not 
in  this  subset  are  subsequently  reduced  in  order  to  maintain  total  probability  equal  to  one.  These 
models  continue  to  be  updated  throughout  training. 

To  develop  a  model,  these  techniques  are  used  to  develop  “features”  which  consist  of  binary 
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functions  based  on  observations  made  during  training.  These  functions,  with  respect  to  the  class 
distributions  discovered  during  training,  are  then  used  in  classification.  These  statistics,  when 
determined  important  to  the  classification  task,  are  then  used  as  constraints  to  which  prospective 
models  must  adhere.  Those  models  that  violate  a  constraint  are  discarded  from  consideration 
[23], 

Consider  the  NPS  chap  corpus  domain  where  we  have  15  distinct  dialog  act  classes.  With 
no  other  information,  by  the  principle  of  maximum  entropy,  we  would  assume  a  uniform  dis¬ 
tribution  of  assign  the  probability  of  a  particular  post  belonging  to  our  categories  as  p(c)  = 
=  0.067.  If,  during  training,  we  discover  that  half  the  time  we  observe  the  word  “how”  in 
a  chat  post  the  post  belongs  to  the  whQuestion  or  ynQuestion  classes,  then  we  would  update 
our  model.  Because  we  have  no  other  information  between  the  two  Question  classes,  we  evenly 
distribute  the  update  across  them  giving 

p(whQuestion|“how”)  =  p(ynQuestion|“how”)  =  0.25 


and 


p(all  other  classes  | “how”)  =  0.0385 

By  repeatedly  comparing  a  test  case’s  data  with  multiple  constraints,  the  classifier  predicts  to 
which  class  the  test  case  belongs. 


2.3.8  Evaluation  Criteria 

Accuracy 

Accuracy  is  a  frequently  used  metric  for  comparing  the  performance  of  classifiers.  Accuracy 
reports  the  percentage  of  items  classified  correctly.  The  formula  for  accuracy  is: 


Accuracy 

Where: 


TruePositives  +  TrueN egatives 

TruePositives  +  FalsePositives  +  TrueN  egatives  +  FalseN egatives 

(2.2) 


True  Positives',  the  number  of  posts  in  the  class  of  interest  that  were  correctly  classified 
False  Positives',  the  number  of  posts  incorrectly  called  members  of  the  class  of  interest 
True  Negatives',  the  number  of  posts  correctly  classified  as  non-class 
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False  Negatives',  the  number  of  posts  that  were  members  of  the  class  of  interest  but  that 
were  incorrectly  classified  as  non-class 


For  our  work,  Accuracy  is  the  number  of  chat  posts  our  classifier  correctly  labeled  divided  by 
the  total  number  of  chat  posts  in  the  test  set 


Precision,  Recall  and  F-score 

Precision  is  the  proportion  of  the  items  a  classifier  labeled  as  class  c,  correctly  versus  the  total 
number  of  it  classified  as  cl.  In  essence,  precision  is  a  measure  of  how  reliable  the  output  of  a 
classification  scheme  is.  The  precision  formula  is: 


Precision 


TruePositives 

TrueP  ositives  +  FalsePositives 


Consider,  however,  that  if  our  classifier  selects  one  correct  example  out  of  many  ( TruePositives 
1),  but  selects  no  others  ( FalsePositives  =  0),  we  would  achieve  a  precision  of  1.00.  Clearly, 
precision  alone  is  an  insufficient  measure  of  performance.  Recall  is  the  proportion  of  items  a 
classifier  labeled  as  class  q  versus  the  total  number  of  examples  of  q  in  the  testing  set.  The 
formula  for  recall  is: 


Recall 


TruePositives 

TruePositives  +  FalseN  egatives 


Similar  to  precision,  Recall  has  a  shortcoming  in  that  if  we  select  everything,  we  can  achieve 
a  recall  of  1.00  because  we  have  classified  no  false  negatives.  Because  algorithmic  approaches 
may  be  biased  in  favor  of  either  precision  or  recall,  and  these  biases  frequently  sacrifice  one  for 
the  other,  we  provide  an  F-score  for  our  results  [16].  The  F-score  is  a  harmonic  mean  and  is 
given  by  the  formula: 

2 

r  -score  =  - ; - — . 

- - - 1 - - — 

Precision  Recall 


Confusion  Matrices 

While  Accuracy,  Precision,  Recall  and  F-score  provide  a  high-level  indication  of  a  classifiers 
performance,  they  provide  no  utility  in  determining  where  the  classifier  erred.  A  confusion 
matrix  can  be  useful  in  error  analysis  by  displaying  truth  information  in  columns  and  classifier 
results  in  rows.  Cell  (x,y)  then  represents  the  number  of  items  in  class  y  that  our  classifier 
labeled  as  a;  [10].  A  confusion  matrix  is  then: 
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TRUTH 


LABELS 


+ 

- 

+ 

True  Positives 

Lalse  Positives 

- 

Lalse  Negatives 

True  Negatives 

Table  2.4:  Example  Confusion  Matrix 


Note  that  the  cell  entries  in  Table  2.4  directly  correspond  to  the  terms  used  in  Accuracy ,  Preci¬ 
sion  and  Recall  above. 

Consider  an  example  binary  classification  task  performed  on  a  set  consisting  of  100  test  cases 
with  10  belonging  to  class  c\  and  90  belonging  to  class  c2.  If  our  classifier  correctly  labels  5 
cases  that  belong  to  c\  and  mislabeled  no  cases  belonging  to  c2,  our  confusion  matrix  would  be: 


LABELS 


TRUTH 


Cl 

c2 

Cl 

5 

0 

c2 

5 

90 

Table  2.5:  Confusion  Matrix  with  Sample  Data 


We  have  5  correctly  labeled  examples  as  shown  in  cell  (ci,  Ci).  In  other  terms,  we  have  True 
Positives  -  5.  Cell  (ci,  c2)  shows  that  we  did  not  mislabel  any  examples  of  c2  as  belonging  to 
class  ci  ( False  Positives  =  0)  and  cell  (c2,  ci)  indicates  that  we  have  mislabeled  5  ci  cases  as  not 
belonging  to  c1;  or  False  Negatives  =  5.  Linally,  cell  (c2,  c2)  shows  that  we  correctly  identified 
all  c2  cases  (True  Negatives  =  90). 

Lrom  our  formulas  above: 


Precision 

Recall 

F-score 


-  =  1.00 

£- 

2 

X  +  X  “ 

1.0  ~  0.5 


0.667 
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Additionally,  we  can  see  a  shortcoming  of  using  Accuracy  as  a  measure  of  performance  when 
there  are  many  non-examples  in  a  test  set.  In  this  case,  Accuracy  =  =  0.95.  While  a 

measure  of  95%  seems  satisfactory,  it  obfuscates  the  fact  that  our  classifier  missed  half  of  the 
example  cases  we  may  have  been  interested  in. 

One’s  choice  in  evaluation  criteria  is  clearly  important  in  determining  the  true  performance  of 
any  classifier. 
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CHAPTER  3: 
TECHNICAL  APPROACH 


3.1  Introduction 

Part  of  speech  tagging  is  useful  in  dialog  act  tagging  as  shown  in  Forsyth  [6]  and  Wu  et  al. 
[8].  Unfortunately,  at  the  current  state-of-the-art,  accurate  grammatical  tagging  requires  hand- 
annotation  in  the  chat  domain.  We  hypothesize  that  by  using  an  MLE  part  of  speech  tags, 
similar  dialog  act  tagging  performance  is  achievable  with  significantly  less  effort  vis-a-vis  hand 
POS  tagging. 

In  this  chapter,  we  describe  the  data  sources  and  experimental  design. 

3.2  Sources  of  Data 

We  elected  to  generate  our  MLE  part  of  speech  tags  from  a  domain  outside  of  chat  in  order  to 
test  the  viability  of  our  cross-genre  approach. 

3.2.1  Wall  Street  Journal  and  Brown  Corpora 

In  order  to  produce  a  cross-genre,  maximum  likelihood  estimation  (MLE)  part  of  speech  tagger 
we  counted  the  number  of  words  and  their  corresponding  parts  of  speech  in  the  Wall  Street  Jour¬ 
nal  and  Brown  corpora.  For  the  MLE  tags  we  applied,  for  each  word  in  the  CPOS  dictionary, 
the  part  of  speech  that  had  the  highest  count  in  the  combined  corpora.  We  refer  to  these  tags 
as  “cheap”  part-of-speech  (or  “CPO”)  tags.  Because  the  tag  set  used  in  the  Brown  corpora  was 
larger,  we  mapped  some  of  the  Brown  tags  to  their  Wall  Street  Journal  equivalents.  In  addition, 
all  words  in  the  CPOS  dictionary  were  converted  to  lower  case. 

To  reduce  the  size  of  the  CPOS  dictionary,  tokens  that  consisted  of  cardinal  numbers  (POS 
tagged  as  “CD”)  were  removed  and  later  recognized  by  regular  expressions.  Our  methodology 
resulted  in  a  dictionary  with  74,034  entries.  Note  that  we  did  not  use  any  chat  corpus  data  in 
creating  this  dictionary. 

3.2.2  NPS  Chat  Corpus 

The  chat  data  originally  collected  by  Lin  in  2006  is  described  in  Lin  [5].  She  collected  over 
477,000  individual  posts  by  3,290  unique  authors.  A  portion  of  this  corpus  was  anonymized  by 
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Forsyth  who  masked  personally  identifiable  information  such  as  names  and  ages.  Users’  chat 
aliases  were  replaced  with  templates  assigned  based  on  chat  room  (including  age  group),  date 
and  order  that  each  user  joined  the  respective  chat  room. 

Forsyth  part-of-speech  and  dialog  act  tagged  the  anonymized  portion  of  the  Lin  corpus  consist¬ 
ing  of  10,567  chat  posts  [6].  This  subset  is  known  as  the  NPS  chat  corpus.  We  considered  his 
tags,  both  POS  and  dialog  act,  as  “ground  truth”  and  compared  the  performance  of  our  dialog 
act  classifier  based  on  his  parts  of  speech  and  our  cheap  parts  of  speech.  Table  3.1  shows  the 


Post  Count 

Percent  of  Total 

Statement 

3185 

30.14% 

System 

2632 

24.91% 

Greet 

1363 

12.90% 

Emotion 

1106 

10.47% 

ynQuestion 

550 

5.20% 

whQuestion 

533 

5.04% 

Accept 

233 

2.20% 

Bye 

195 

1.85% 

Emphasis 

190 

1.80% 

Continuer 

168 

1.59% 

Reject 

159 

1.50% 

yAnswer 

108 

1.02% 

nAnswer 

72 

0.68% 

Clarify 

38 

0.36% 

Other 

35 

0.33% 

Table  3.1 :  Number  of  Posts  in  NPS  Chat  Corpus  by  Dialog  Act 

breakdown  of  posts  by  dialog  act  class  in  the  entire  NPS  chat  corpus.  Note  the  disparities  in 
the  sizes  of  the  different  dialog  act  classes  as  shown  in  column  two.  Naive  Bayes  classifiers  use 
class  priors  (P(C)).  These  are  displayed  in  column  three.  The  large  differences  in  class  priors 
will  significantly  skew  our  classifier  results  toward  the  Statement  and  System  dialog  act  classes. 

3.2.3  Division  of  Data 

In  order  to  directly  compare  our  classifier  results  with  Forsyth’s,  we  considered  each  chat  post 
independently  and  held-out  ten  percent  of  the  posts  for  testing.  This  test  set  was  not  used 
in  training.  Actual  dialog  act  tags  were  maintained  in  the  test  set  data  in  order  to  determine 
classifier  performance. 
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We  tested  over  50  such  divisions.  This  resulted  in  an  average  of  9,513.34  posts  (90.02%  of 
total)  for  training  and  1,053.66  (9.98%)  posts  for  testing. 

3.3  Classification  Tasks 

Our  task  was  to  determine  the  effectiveness  of  cheap  parts  of  speech  in  determining  dialog  act 
class  by  use  of  a  naive  Bayes  classifier.  We  performed  a  multi-class  classification  task  over 
the  15  dialog  act  classes.  Our  results  contain  a  comparison  of  performance  between  computa¬ 
tionally  expensive  techniques  with  human  verification  to  determine  accurate  POS  tags  versus 
“cheap”  POS  tags. 

3.4  Feature  Selection 

Rather  than  repeating  Forsyth’s  approach  of  using  temporal  and  specific  lexical  features  of 
the  data  (see  Table  2.3),  we  elected  to  use  a  more  traditional,  token-based  approach  for  our 
naive  Bayes  classifier.  We  used  unigrams,  bigrams  and  trigrams  from  POS  tags  only  as  well  as 
bigrams  made  up  of  pairs  of  word/POS  pairs. 

3.4.1  Features 

Naive  Bayes  Classifier  Features: 

1.  Actual  Part  of  Speech  unigrams,  bigrams,  trigrams  (for  comparison) 

2.  Cheap  Part  of  Speech  unigrams,  bigrams,  trigrams 

3.  Word,  Actual  POS  pair  bigrams  (for  comparison) 

4.  Word,  Cheap  POS  pair  bigrams 

5.  Word  Bigrams 

Figures  3.1,  3.2  and  3.3  show  the  total  counts  of  features  in  all  10,567  posts  in  the  NPS  chat 
corpus.  We  observe  that  the  number  of  training  features  for  each  dialog  act  class  is  skewed 
toward  Statement  and  System  classes.  Though  there  are  more  posts  tagged  as  Emotion  than 
either  of  the  Question  classes,  the  count  of  features  in  the  Emotion  dialog  act  class  is  lower.  We 
can  infer  that  posts  in  the  Emotion  class  are  generally  shorter  than  Question  posts. 

3.5  Experiment  Setup 

For  our  experiments,  we  read  in  all  10,567  posts  in  the  NPS  chat  corpus.  Training  posts  were 
segregated  into  two  data  sets,  one  of  which  retained  actual  POS  tags  and  one  that  replaced  these 
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Figure  3.1 :  Number  of  Unigram  Features  by  Dialog  Act 


with  CPOS  tags.  These  two  structures  produced  the  feature  vectors  used  for  testing.  Test  posts 
were  similarly  separated  into  two  data  structures  one  retaining  the  actual  POS  tags,  the  other 
utilizing  CPOS  tags.  Feature  vectors  were  calculated  for  each  individual  post  in  the  test  data 
structures. 

We  chose  to  use  naive  Bayes  classifiers  with  our  different  features  due  to  their  speed. 

Each  POS  tag  was  associated  with  an  integer  that  functioned  as  an  index  into  arrays  that  main¬ 
tained  the  feature  counts. 

For  each  test  post,  we  computed: 

C=  argmax  =  log  P(C't)  ^  log  P*  (f)  \ Cj) 

CiEClasses  ■ 

Noting  the  disparity  in  between  the  class  populations,  we  expected  that  the  class  prior  proba¬ 
bilities  would  affect  the  performance  of  a  naive  Bayes  classifier.  In  order  to  help  overcome  this 
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Figure  3.2:  Number  of  Bigram  Features  by  Dialog  Act 


disparity,  we  also  computed: 

C  =  argrnax  logP(O)  ^  log  P*(wppj |C*)  +  ^  log P* (pbk\Ci)  (3.1) 

CitzClasses  , 

3  k 

where  wppj  is  the  word/POS  pair  bigram  j  and  pb  is  the  POS  bigram  k. 

Overall  Accuracy  was  computed  as  NumberTf  Test  Posts  f°r  comparison  with  Forsyth  and  because  of 
the  large  number  of  True  Negatives  skews  the  accuracy  (as  shown  in  equation  2.2)  calculations 
toward  1.00  so  as  to  make  them  useless. 

We  noted  that  Witten-Bell  smoothing  performed  better  than  LaPlace  for  our  experiments.  We 
provide  results  for  Witten-Bell  smoothed  unigrams,  bigrams  and  trigrams  and  LaPlace  smooth¬ 
ing  of  bigrams  for  comparison. 

3.5.1  Data  Preprocessing 

In  processing  both  the  actual  and  cheap  data  structures,  we  converted  all  word  tokens  to  lower 
case  to  match  our  CPOS  dictionary.  The  parts  of  speech  applied  by  Forsyth  were  not  changed 
in  the  actual  data  structure.  In  the  cheap  data  structure,  we  replaced  the  actual  parts  of  speech 
with  the  cheap  parts  of  speech  found  in  the  CPOS  dictionary.  For  each  post,  start-of-post  and 
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Figure  3.3:  Number  of  Trigram  Features  by  Dialog  Act 


end-of-post  markers  were  added  to  preserve  context  in  bigram  and  trigram  classification  tasks. 

Because  none  of  the  emoticons  were  contained  in  either  the  WSJ  or  Brown  corpora,  these  were 
initially  assigned  the  CPOS  tag  of  “UNK”  or  unknown.  In  addition  to  providing  results  with  no 
effort  to  recognize  emoticons,  we  augmented  the  CPOS  dictionary  to  recognize  emoticons  in 
order  to  compare  performance  with  the  added  context  provided  by  these  chat  features. 

Emoticons  were  assigned  the  interdiction  (“UH”)  POS  by  Forsyth,  we  compared  performance 
of  our  classifier  with  “UH”  and  other  POS  tags.  In  addition  to  marking  emoticons  with  “UNK” 
(not  found  in  the  CPOS  dictionary)  and  “UH,”  we  followed  Forsyth’s  recommendation  and 
tested  our  classifier  marking  these  features  with  the  unique  POS  tag  “EMO”  [6].  We  further 
divided  the  emoticons  into  two  categories,  those  found  in  Appendix  A  and  those  composed  of 
phrase  abbreviations  such  as  “lol.”  We  provide  results  of  our  experiments  using  all  emoticon 
tagging  schemes  in  Chapter  4. 

For  our  experiments,  because  we  were  not  interested  in  identifying  individuals,  we  further 
masked  all  user  names  in  training  and  test  posts  with  a  unique  word.  Because  this  word  was 
not  found  in  the  CPOS  dictionary,  we  automatically  assigned  the  POS  tag  “NNP”  for  accurate 
performance  comparison. 


28 


Though  our  effort  was  not  focused  on  POS  tagging  accuracy,  we  noted  that  our  CPOS  tagging 
methodology  produced  an  accuracy  ranging  from  68.16%  to  71.36%  depending  on  our  selection 
of  emoticon  POS  marks.  Figure  3.4  provides  an  example  of  the  difference  introduced  by  the 

Post  with  Actual  POS  tags:  with/IN  an/DT  answer/NN  like/IN  that/DT  .../:  nope/UH  ..../:  lol/UH 
Post  with  Cheap  POS  tags:  with/IN  an/DT  answer/NN  like/IN  that/IN  .../:  nope/UH  ..../UNK  lol/EMO 

Figure  3.4:  Example  Post  Displaying  Differences  in  POS  Markings 

CPOS  methodology.  Start-  and  end-of-post  markings  have  been  removed  for  clarity.  Note  that 
the  actual  POS  tagged  post  includes  the  POS  tags  as  applied  by  Forsyth.  The  same  post,  with 
CPOS  tags  applied,  shows  that  “with,”  “an,”  “answer,”  and  “like”  are  most  often  used  in  the  Wall 
Street  Journal  and  Brown  corpora  with  the  same  tags  as  Forsyth  applied.  “That,”  however,  is 
most  frequently  tagged  as  “IN”  (Preposition/subordinating  conjunction)  in  the  WSJ  and  Brown 
corpora  and  is  marked  as  such  by  our  CPOS  dictionary.  In  fact,  “that”  is  POS  tagged  as  “IN” 
6,682  times  and  as  “DT”  4,373  times  in  WSJ  and  Brown.  The  string  “...”  is  recognized  by  the 
CPOS  dictionary,  however  when  it  includes  extra  characters,  it  is  not  and  is  given  the  “UNK” 
tag  as  can  be  seen  above.  Note  also  that  the  popular  emoticon  “lol”  (laugh  out  loud)  is  marked 
with  our  “EMO”  tag  as  specified  in  the  settings  used  in  this  particular  experiment. 

For  illustration,  actual  POS  bigrams  for  this  sample  post  would  produce: 

(IN,DT),  (DT,NN),  (NN,IN),  (IN,DT),  (DT,:),  (:,UH),  (UH,:),  (:,UH). 

Using  cheap  POS  with  no  emoticon  recognition  would  result  in: 

(IN,DT),  (DT,NN),  (NN,IN),  (IN,IN),  (IN,:),  (:,UH),  (UH,UNK),  (UNK, UNK). 

Augmenting  our  CPOS  dictionary  to  tag  emoticons  with  our  “EMO”  tag  gives: 

(IN,DT),  (DT,NN),  (NN,IN),  (IN, IN),  (IN,:),  (:,UH),  (UH,UNK),  (UNK, EMO). 

3.5.2  Random  Trials 

We  conducted  50  random  trials  in  which  10%  of  the  chat  posts  were  held-out  for  testing.  Con¬ 
fusion  matrices  for  selected  experiment  runs  are  included  in  Appendix  C. 

Having  completed  the  discussion  of  our  technical  approach,  we  present  our  results  in  the  next 
chapter. 
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CHAPTER  4: 

RESULTS  AND  ANALYSIS 


4.1  Introduction 

In  this  chapter,  we  present  the  results  of  our  experiments.  Comparison  between  the  performance 
of  the  naive  Bayes  classifier  with  various  settings  and  feature  selections  are  provided.  For  ad¬ 
ditional  comparison,  consider  that  Forsyth  achieved  a  top  dialog  act  tagging  accuracy  of  83.2% 
using  a  time-consuming  process  that  included  300  iterations  by  a  neural  network  incorporating 
24  features.  These  results  were  achieved  after  human  verified  part  of  speech  tags  were  applied. 
Due  to  limited  time  available  for  this  work,  we  could  not  recreate  Forsyth’s  experiments  over 
our  training/testing  splits.  We  noted  that  each  of  our  experiment  runs  completed  in  an  average 
of  27.5  seconds  on  a  desktop  machine  equipped  with  an  Intel  Core  i7  and  8  gigabytes  of  ram. 
Note  that  this  includes  loading  all  dictionary  and  chat  data,  training  and  testing  on  both  actual 
POS  tagged  posts  and  cheap  POS  tagged  posts. 

4.2  Results 

For  all  experiments,  we  considered  the  human-verified  dialog  act  tags  applied  by  Forsyth  to 
be  ground  truth.  The  results  provided  in  this  chapter  refer  to  the  performance  of  the  classifier 
using  these  tags  as  “actual”  results.  These  are  provided  for  comparison  with  the  four  emoticon 
tagging  schemes  below.  Note  that  the  actual  POS  results  do  not  change  between  experiment 
sets.  In  all  confusion  matrices  and  summaries,  the  results  derived  when  using  actual  POS  are 
provided  with  the  results  of  cheap  POS  application  for  easy  reference. 

Our  results  include  performance  metrics  from  naive  Bayes  classifiers  using  part  of  speech  un¬ 
igrams,  bigrams,  trigrams,  word  bigrams,  and  word/POS  pair  bigrams,  all  using  Witten-Bell 
smoothing.  We  also  provide  LaPlace  smoothed  results  for  POS  bigrams  for  comparison  to 
Witten-Bell  for  these  experiments. 

We  considered  our  results  separately  according  to  the  tagging  scheme  applied  to  emoticons. 
Appendix  B  provides  some  insight  into  how  our  tags  grouped  features  differently.  Essentially, 
we  are  binning  words  by  their  maximum  likelihood  estimation  parts  of  speech. 

No  other  changes  were  made  to  the  algorithm  between  these  sets  of  results.  We  initially  made 
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no  effort  to  recognize  emoticon  features  noting  that  none  appeared  in  the  cheap  POS  dictionary. 
In  our  first  set  of  50  experiments,  these  were  automatically  assigned  the  “UNK”  part  of  speech 
tag. 


4.2.1  Emoticons  Not  Recognized 

Making  no  effort  to  recognize  emoticons  results  in  our  cheap  POS  tagging  achieving  an  accu¬ 
racy  of  68.16%.  Essentially,  these  features  are  counted  with  all  other  unrecognized  words,  a  set 
that  includes  misspelled  words,  unusual  use  of  punctuation  (e.g.  etc. 
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Run  Number: 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

Training  Posts 

9581 

9521 

9526 

9537 

9477] 

9516 

9468 

9493 

9517 

9511 

9525 

9558 

9501 

9477 

9562 

9519 

9512| 

9473 

Test  Posts 

986 

1046 

1041 

1030 

1090 

1051 

1099 

1074 

1050 

1056 

1042 

1009 

1066 

1090 

1005 

1048 

1055 

1094 

MLE  performance 

0.307 

0.285 

0.296 

0.312 

0.312 

0.310 

0.306 

0.304 

0.307 

0.309 

0.316 

0.295 

0.298 

0.298 

0.299! 

0.293 

0.282 

0.283 

Actual  POS  Unigrams 

0.662 

0.672 

0.663 

0.678 

0.672 

0.684 

0.693 

0.694 

0.696 

0.670 

0.677 

0.681 

0.672 

0.694 

0.689 

0.677J 

0.681 

0.683 

Cheap  POS  Unigrams 

0.657 

0.657 

0.628 

0.652 

0.661 

0.656 

0.669 

0.673 

0.652 

0.667 

0.649 

0.654 

0.646 

0.641 

0.662 

0.656 

0.651 

0.654 

LaPlace  Actual  POS  2-grams 

0.717 

0.721 

0.720 

0.729 

0.720 

0.736 

0.746 

0.742 

0.750 

0.730 

0.727 

0.736 

0.733 

0.741 

0.732’ 

0.736 

0.735 

0.740 

LaPlace  Cheap  POS  2-grams 

0.723 

0.725 

0.709 

0.714 

0.725 

0.731 

0.733 

0.734: 

0.724 

0.727 

0.718 

0.714 

0.705 

0.720 

0.721 

0.720 

0.722 

0.723 

Actual  POS  Bigrams 

0.729 

0.723 

0.729 

0.742 

0.726 

0.733 

0.744 

0.754 

0.747 

0.741 

0.731 

0.745 

0.727 

0.734 

0.738: 

0.736 

0.741 

0.744 

Cheap  POS  Bigrams 

0.729 

0.734 

0.719 

0.724 

0.734 

0.740 

0.748 

0.746 

0.740 

0.743 

0.725 

0.719 

0.720 

0.729 

0.731 

0.730 

0.727 

0.723 

Word/Actual-POS  pair  2-grams  +  POS  2-grams 

0.829 

0.820 

0.822 

0.826 

0.854 

0.839 

0.854 

0.846 

0.840 

0.838 

0.850 

0.832 

0.841 

0.850 

0.836 

0.824 

0.829 

0.836 

Word/Cheap-POS  pair  2-grams  +  POS  2-grams 

0.834 

0.822 

0.828 

0.820 

0.845 

0.842 

0.854 

0.834 

0.835 

0.833 

0.839 

0.832 

0.833 

0.842 

0.825 

0.819 

0.827 

0.824 

Actual  POS  Trigrams 

0.809 

0.811 

0.810 

0.817 

0.828 

0.816 

0.826 

0.820 

0.823 

0.823 

0.837 

0.813 

0.821 

0.836 

0.823 

0.811 

0.819 

0.820 

Cheap  POS  Trigrams 

0.815 

0.805 

0.803 

0.807 

0.828 

0.826 

0.832 

0.807 

0.811 

0.819 

0.829 

0.811 

0.818 

0.823 

0.807 

0.806 

0.811 

0.814 

word  2-grams 

0.822 

0.823 

0.810 

0.822 

0.849 

0.821 

0.850 

0.823 

0.835 

0.823 

0.840 

0.826 

0.832 

0.837 

0.821 

0.819 

0.824 

0.821 

Run  Number: 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

Training  Posts 

9554 

9527 

9542 

9518 

9492 

9547 

9523 

9534 

9491 

9546 

9553 

9497 

9496 

9506 

9480 

9563 

9498 

9458 

Test  Posts 

1013 

1040 

1025 

1049 

1075 

1020 

1044 

1033 

1076 

1021 

1014 

1070 

1071 

1061 

1087 

1004 

1069 

1109 

MLE  performance 

0.316 

0.303 

0.286 

0.299 

0.311 

0.298 

0.291 

0.300 

0.299 

0.299 

0.296 

0.297] 

0.288 

0.293 

0.315 

0.282 

0.275 

0.298 

Actual  POS  Unigrams 

0.677 

0.680 

0.685 

0.684 

0.687 

0.690 

0.671 

0.684 

0.676 

0.679 

0.686 

0.676 

0.673 

0.647 

0.695 

0.67l| 

0.675 

0.682 

Cheap  POS  Unigrams 

0.640 

0.651 

0.654 

0.663 

0.649 

0.679 

0.664 

0.661 

0.648 

0.657 

0.665 

0.664 

0.653 

0.627 

0.669 

0.643 

0.659 

0.656 

LaPlace  Actual  POS  2-grams 

0.735 

0.721 

0.738 

0.735 

0.730 

0.745 

0.725 

0.733 

0.724 

0.728 

0.732 

0.727] 

0.725 

0.715 

0.739 

0.735 

0.728 

0.738 

LaPlace  Cheap  POS  2-grams 

0.710 

0.713 

0.722 

0.720 

0.716 

0.742 

0.711 

0.732 

0.714 

0.718 

0.730 

0.727] 

0.713 

0.704 

0.733 

0.713 

0.724 

0.710 

Actual  POS  Bigrams 

0.736 

0.729 

0.734 

0.741 

0.725 

0.750 

0.731 

0.724j 

0.741 

0.738 

0.735 

0.739 

0.727 

0.730 

0.741 

0.734 

0.728 

0.739 

Cheap  POS  Bigrams 

0.728 

0.725 

0.726 

0.741 

0.719 

0.749 

0.720 

0.735 

0.723 

0.728 

0.732 

0.745 

0.721 

0.716 

0.753 

0.725 

0.737 

0.717 

Word/Actual-POS  pair  2-grams  +  POS  2-grams 

0.831 

0.847 

0.837 

0.845 

0.832 

0.851 

0.827 

0.832 

0.840 

0.836 

0.845 

0.836 

0.826 

0.833 

0.847 

0.839 

0.837 

0.844 

Word/Cheap-POS  pair  2-grams  +  POS  2-grams 

0.819 

0.829 

0.837 

0.830 

0.832 

0.851 

0.832 

0.834 

0.837 

0.831 

0.852 

0.823 

0.813 

0.830 

0.835 

0.841 

0.845 

0.834 

Actual  POS  Trigrams 

0.819 

0.825 

0.825 

0.824 

0.815 

0.845 

0.812 

0.817 

0.819 

0.820 

0.831 

0.826 

0.796 

0.807 

0.834 

0.824 

0.816 

0.834 

Cheap  POS  Trigrams 

0.821 

0.811 

0.819 

0.817 

0.812 

0.841 

0.817 

0.820 

0.825 

0.817 

0.830 

0.817j 

0.798 

0.816 

0.833 

0.810 

0.819 

0.823 

word  2-grams 

0.810 

0.838 

0.828 

0.831 

0.816 

0.835 

0.823 

0.826 

0.828 

0.829 

0.835 

0.816 

0.810 

0.825 

0.833 

0.829 

0.833 

0.841 

Run  Number: 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

Mean 

Max 

Min 

Training  Posts 

9513 

9560 

9528 

9484 

9548 

9477 

9436 

9471 

9495 

9577] 

9511 

9445 

9485 

9538 

9513.3 

9581 

9436 

Test  Posts 

1054 

1007 

1039 

1083 

1019 

1090 

1131 

1096 

1072 

990 

1056 

1122 

1082 

1029 

1053.7 

1131 

986 

MLE  performance 

0.309 

0.327 

0.278 

0.302 

0.295 

0.303 

0.304 

0.302 

0.311 

0.287 

0.307 

0.295 

0.291 

0.296 

0.2993 

0.327 

0.275 

Actual  POS  Unigrams 

0.680 

0.684 

0.679 

0.682 

0.654 

0.701 

0.691 

0.675’ 

0.683 

0.683 

0.676 

0.688 

0.658 

0.684 

0.6795 

0.701 

0.647 

Cheap  POS  Unigrams 

0.646 

0.652 

0.639 

0.646 

0.629 

0.671 

0.655 

0.637 

0.660 

0.651 

0.644 

0.651 

0.638 

0.669 

0.6534 

0.679 

0.627 

LaPlace  Actual  POS  2-grams 

0.733 

0.737 

0.726 

0733 

0.714 

0.760 

0.730 

0.727 

0.737 

0.713 

0.737 

0.744 

0.704 

0.733 

0.7315 

0.760 

0.704 

LaPlace  Cheap  POS  2-grams 

0.701 

0726 

07131 

0.704 

0.709 

0.734 

0.714 

0.702 

0.726 

0.697 

0.716 

0.711 

0.697 

0.726 

0.7183 

0.742 

0.697 

Actual  POS  Bigrams 

0.741 

0.749 

0.723 

0741 

0.736 

0.761 

0.744 

0.719 

0.735; 

0.728 

0.750 

0.740 

0.712 

0.733 

0.7359 

0.761 

0.712 

Cheap  POS  Bigrams 

0.715 

0.741 

0.719; 

0.712 

0.722 

0.749 

0.717 

0.706 

0.737 

0.697 

0.734 

0.719 

0.707 

0.716 

0.7279 

0.753 

0.697 

Word/Actual-POS  pair  2-grams  +  POS  2-grams 

0.832 

0.831 

0.838 

0.837 

0.799 

0.839 

0.848 

0.829 

0.838 

0.829 

0.820 

0.834 

0.821 

0.831 

0.8356 

0.854 

0.799 

Word/Cheap-POS  pair  2-grams  +  POS  2-grams 

0.832 

0.833 

0.842 

0.830 

0.809 

0.834 

0.828 

0.829 

0.830 

0.823 

0.816 

0.831 

0.816 

0.824 

0.8315 

0.854 

0.809 

Actual  POS  Trigrams 

0.807 

0.812 

0.823 

0.813 

0.795 

0.826 

0.831 

0.813 

0.828 

0.814 

0.808 

0.826 

0.806 

0.826 

0.8196 

0.845 

0.795 

Cheap  POS  Trigrams 

0.804 

0.815 

0.826 

0.806 

0.786 

0.818 

0.821 

0.800 

0.811 

0.798 

0.795 

0.812 

0.810 

0.809 

0.8146 

0.841 

0.786 

word  2-grams 

0.823 

0.831 

0.833 

0.834 

0.805 

0.825 

0.843 

0.813 

0.838 

0.818 

0.809 

0.822 

0.821 

0.819 

0.8263 

0.850 

0.805 

Figure  4.1 :  Summary  of  Results  with  Emoticons  Unrecognized 


The  emphasized  row  in  figure  4.1  shows  that  our  best  results  were  achieved  using  equation  3.1. 
These  rows  represent  using  the  sum  of  feature  probabilities  of  word/POS  pair  bigrams  and  of 
POS  bigrams  in  our  naive  Bayes  classifier.  We  can  see  that  using  actual  POS  tags  we  were 
able  to  provide  better  overall  accuracy  than  was  achieved  by  Forsyth.  In  fact,  using  cheap  POS, 
which  require  no  preprocessing  time  or  effort,  nearly  equaled  the  previous  work. 

In  order  to  determine  if  we  would  achieve  better  dialog  act  classification  accuracy  with  different 
emoticon  tags,  we  attempted  three  new  tagging  schemes. 

4.2.2  Emoticons  Labeled  as  Interjections 

One  of  the  decisions  made  by  Forsyth  in  developing  the  NPS  chat  corpus  was  that  emoticons 
should  be  labeled  as  interjections  (“UH”).  We  used  regular  expressions  to  identify  both  types  of 
emoticons  and  augmented  our  cheap  POS  dictionary  to  also  label  them  as  interjections. 

Using  this  scheme,  our  MLE  part  of  speech  tagger  achieved  its  highest  level  of  accuracy  match¬ 
ing  the  truth  POS  tags  only  71.36%  of  the  time. 
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Run  Number: 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

Training  Posts 

9581 

9521 

9526 

9537 

9477] 

9516 

9468 

9493 

9517 

9511 

9525 

9558 

9501 

9477 

9562 

9519 

9512| 

9473 

Test  Posts 

986 

1046 

1041 

1030 

1090 

1051 

1099 

1074 

1050 

1056 

1042 

1009 

1066 

1090 

1005 

1048 

1055 

1094 

MLE  performance 

0.307 

0.285 

0.296 

0.312 

0.312 

0.310 

0.306 

0.304 

0.307 

0.309 

0.316 

0.295 

0.298 

0.298 

0.299 

0.293 

0.282 

0.283 

Actual  POS  Unigrams 

0.662 

0.672 

0.663 

0.678 

0.672 

0.684 

0.693 

0.694 

0.696 

0.670 

0.677 

0.681 

0.672 

0.694 

0.689 

0.677 

0.681 

0.683 

Cheap  POS  Unigrams 

0.631 

0.638 

0.605 

0.648 

0.650 

0.634 

0.660 

0.661 

0.645 

0.643 

0.628 

0.644 

0.630 

0.649 

0.646 

0.642 

0.636 

0.654 

LaPlace  Actual  POS  2-grams 

0.717 

0.721 

0.720 

0.729 

0.720 

0.736 

0.746 

0.742 

0.750 

0.730 

0.727 

0.736 

0.733 

0.741 

0.732: 

0.736 

0.735 

0.740 

LaPlace  Cheap  POS  2-grams 

0.686 

0.696 

0.692 

0.706 

0.715 

0.707 

0.716 

0.723[ 

0.713 

0.705 

0.699 

0.699 

0.687 

0.713 

0.707 

0.714 

0.708 

0.716 

Actual  POS  Bigrams 

0.729 

0.723 

0.729 

0.742 

0.726 

0.733; 

0.744 

0.754 

0.747 

0.741 

0.731 

0.745 

0.727 

0.734 

0.738: 

0.736 

0.741 

0.744 

Cheap  POS  Bigrams 

0.704 

0.705 

0.701 

0.710 

0.717 

0.716 

0.723 

0.723[ 

0.728 

0.717 

0.696 

0.704 

0.700 

0.720 

0.720 

0.720 

0.713 

0.718 

Word/Actual-POS  pair  2-grams  +  POS  2-grams 

0.829 

0.820 

0.822 

0.826 

0.854 

0.839 

0.854 

0.846 

0.840 

0.838 

0.850 

0.832 

0.841 

0.850 

0.836 

0.824 

0.829 

0.836 

Word/Cheap-POS  pair  2-grams  +  POS  2-grams 

0.836 

0.824 

0.827 

0.820 

0.842 

0.842 

0.851 

0.832 

0.838 

0.831 

0.841 

0.833 

0.835 

0.841 

0.825 

0.822 

0.823 

0.822 

Actual  POS  Trigrams 

0.809 

0.811 

0.810 

0.817 

0.828 

0.816 

0.826 

0.820 

0.823 

0.823 

0.837 

0.813 

0.821 

0.836 

0.823 

0.811 

0.819 

0.820 

Cheap  POS  Trigrams 

0.813 

0.804 

0.804 

0.808 

0.828 

0.820 

0.830 

0.806 

0.809 

0.815 

0.834 

0.809 

0.819 

0.824 

0.811 

0.806 

0.811 

0.812 

word  2-grams 

0.822 

0.823 

0.810 

0.822 

0.849 

0.821 

0.850 

0.823 

0.835 

0.823 

0.840 

0.826 

0.832 

0.837 

0.821 

0.819 

0.824 

0.821 

Run  Number: 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

Training  Posts 

9554 

9527 

9542 

9518 

9492 

9547 

9523 

9534 

9491 

9546 

9553 

9497 

9496 

9506 

9480 

9563 

9498 

9458 

Test  Posts 

1013 

1040 

1025 

1049 

1075 

1020 

1044 

1033 

1076 

1021 

1014 

1070 

1071 

1061 

1087 

1004 

1069 

1109 

MLE  performance 

0.316 

0.303 

0.286 

0.299 

0.311 

0.298 

0.291 

0.300 

0.299 

0.299 

0.296 

0.297] 

0.288 

0.293 

0.315 

0.282 

0.275 

0.298 

Actual  POS  Unigrams 

0.677 

0.680 

0.685 

0.684 

0.687 

0.690 

0.671 

0.684 

0.676 

0.679 

0.686 

0.676 

0.673 

0.647 

0.695 

0.67l| 

0.675 

0.682 

Cheap  POS  Unigrams 

0.635 

0.644 

0.642 

0.651 

0.647 

0.666 

0.650 

0.644 

0.630 

0.631 

0.652 

0.644 

0.643 

0.618 

0.655 

0.633 

0.646 

0.650 

LaPlace  Actual  POS  2-grams 

0.735 

0.72lj 

0.738 

0.735 

0.730 

0.745 

0.725 

0.733) 

0.724 

0.728 

0.732 

0.727] 

0.725 

0.715 

0.739 

0.735 

0.728 

0.738 

LaPlace  Cheap  POS  2-grams 

0.711 

0.704 

0.709 

0.701 

0.709 

0.723 

0.691 

0.710 

0.697; 

0.708 

0.715 

0.706 

0.703 

0.694 

0.710 

0.700 

0.703 

0.706 

Actual  POS  Bigrams 

0.736 

0.729 

0.734 

0.741 

0.725 

0.750 

0.731 

0.724 

0.741 

0.738 

0.735 

0.739 

0.727 

0.730 

0.741 

0.734 

0.728 

0.739 

Cheap  POS  Bigrams 

0.723 

0.715 

0.703 

0.715 

0.709 

0.730 

0.700 

0.715 

0.710 

0.716 

0.715 

0.723 

0.710 

0.700 

0.729 

0.706 

0.717 

0.707 

Word/Actual-POS  pair  2-grams  +  POS  2-grams 

0.831 

0.847 

0.837 

0.845 

0.832 

0.851 

0.827 

0.832 

0.840 

0.836 

0.845 

0.836 

0.826 

0.833 

0.847 

0.839 

0.837 

0.844 

Word/Cheap-POS  pair  2-grams  +  POS  2-grams 

0.821 

0.833 

0.838 

0.831 

0.828 

0.847 

0.831 

0.834 

0.834 

0.832 

0.848 

0.826 

0.817 

0.834 

0.836 

0.843 

0.843 

0.838 

Actual  POS  Trigrams 

0.819 

0.825 

0.825 

0.824 

0.815 

0.845 

0.812 

0.817 

0.819 

0.820 

0.831 

0.826 

0.796 

0.807 

0.834 

0.824 

0.816 

0.834 

Cheap  POS  Trigrams 

0.816 

0.809 

0.822 

0.820 

0.814 

0.842 

0.815 

0.821 

0.823 

0.820 

0.832 

0.821 

0.797 

0.812 

0.834 

0.818 

0.819 

0.819 

word  2-grams 

0.810 

0.838 

0.828 

0.831 

0.816 

0.835 

0.823 

0.826 

0.828 

0.829 

0.835 

0.816 

0.810 

0.825 

0.833 

0.829 

0.833 

0.841 

Run  Number: 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

Mean 

Max 

Min 

Training  Posts 

9513 

9560 

9528 

9484 

9548 

9477 

9436 

9471 

9495 

9577] 

9511 

9445 

9485 

9538 

9513.3 

9581 

9436 

Test  Posts 

1054 

1007 

1039 

1083 

1019 

1090 

1131 

1096 

1072 

990 

1056 

1122 

1082 

1029 

1053.7 

1131 

986 

MLE  performance 

0.309 

0.327 

0.278 

0.302 

0.295 

0.303 

0.304 

0.302 

0.311 

0.287 

0.307 

0.295 

0.291 

0.296 

0.2993 

0.327 

0.275 

Actual  POS  Unigrams 

0.680 

0.684 

0.679 

0.682 

0.654 

0.701 

0.691 

0.675 

0.683 

0.683 

0.676 

0.688 

0.658 

0.684 

0.6795 

0.701 

0.647 

Cheap  POS  Unigrams 

0.641 

0.646 

0.625 

0.644 

0.621 

0.659 

0.637 

0.625 

0.652 

0.630 

0.653 

0.634 

0.619 

0.654 

0.6413 

0.666 

0.605 

LaPlace  Actual  POS  2-grams 

0.733 

0.737 

0.726 

0.733 

0.714 

0.760 

0.730 

0.727 

0.737 

0.713 

0.737 

0.744 

0.704 

0.733 

0.7315 

0.760 

0.704 

LaPlace  Cheap  POS  2-grams 

0.693 

0.713 

0.700 

0.705 

0.702 

0.733 

0.714 

0.686 

0.711 

0.688 

0.723 

0.705 

0.678 

0.715 

0.7053 

0.733 

0.678 

Actual  POS  Bigrams 

0.741 

0.749 

0.723 

0.741 

0.736 

0.761 

0.744 

0.719 

0.735 

0.728 

0.750 

0.740 

0.712 

0.733 

0.7359 

0.761 

0.712 

Cheap  POS  Bigrams 

0.700 

0726 

0.706 

0.707 

0.711 

0.740 

0.717 

0.694 

0.719 

0.693 

0.736 

0.716 

0.692 

0.708 

0.7129 

0.740 

0.692 

Word/Actual-POS  pair  2-grams  +  POS  2-grams 

0.832 

0.831 

0.838 

0.837 

0.799 

0.839 

0.848 

0.829 

0.838 

0.829 

0.820 

0.834 

0.821 

0.831 

0.8356 

0.854 

0.799 

Word/Cheap-POS  pair  2-grams  +  POS  2-grams 

0.828 

0.829 

0.843 

0.829 

0.806 

0.833 

0.833 

0.828 

0.834 

0.824 

0.815 

0.832 

0.820 

0.827 

0.8316 

0.851 

0.806 

Actual  POS  Trigrams 

0.807 

0.812 

0.823 

0.813 

0.795 

0.826 

0.831 

0.813 

0.828 

0.814 

0.808 

0.826 

0.806 

0.826 

0.8196 

0.845 

0.795 

Cheap  POS  Trigrams 

0.805 

0.815 

0.827 

0.808 

0.785 

0.814 

0.824 

0.805 

0.814 

0.802 

0.795 

0.810 

0.813 

0.808 

0.8149 

0.842 

0.785 

word  2-grams 

0.823 

0.831 

0.833 

0.834 

0.805 

0.825 

0.843 

0.813 

0.838 

0.818 

0.809 

0.822 

0.821 

0.819 

0.8263 

0.850 

0.805 

Figure  4.2:  Summary  of  Results  with  Emoticons  Tagged  as  Interjection 


Table  4.2  shows  that,  again,  using  a  combination  of  feature  vectors  described  in  equation  3.1 
provided  the  highest  average  accuracy.  Forsyth’s  decision  to  tag  emoticons  as  interjections 
performs  better  than  grouping  them  into  the  cheap  “UNK”  category. 

We  then  explored  the  use  of  a  unique  tag  for  emoticons.  We  used  regular  expressions  to  identify 
common  emoticons  and  augmented  our  dictionary  to  tag  recognized  emoticons  with  “EMO.” 

4.2.3  Two  Types  of  Emoticons  as  One  Part  of  Speech 

We  hypothesized  that  emoticons  may  deserve  their  own  part  of  speech  tag  and,  if  so,  that  our 
dialog  act  classification  accuracy  may  improve  with  this  added  information.  To  this  point,  we 
have  seen  that  putting  all  unrecognized  words  into  one  category  provides  less  accuracy  then 
identifying  emoticons  as  interjections.  We  decided  to  give  emoticons  a  unique  cheap  POS  tag 
and  elected  to  tag  them  with  “EMO.” 
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Run  Number: 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

Training  Posts 

9581 

9521 

9526 

9537 

9477] 

9516 

9468 

9493 

9517 

9511 

9525 

9558 

9501 

9477 

9562 

9519 

9512| 

9473 

Test  Posts 

986 

1046 

1041 

1030 

1090 

1051 

1099 

1074 

1050 

1056 

1042 

1009 

1066 

1090 

1005 

1048 

1055 

1094 

MLE  performance 

0.307 

0.285 

0.296 

0.312 

0.312 

0/310 

0.306 

0.304 

0.307 

0.309 

0.316 

0.295 

0.298 

0.298 

0.299! 

0.293 

0.282 

0.283 

Actual  POS  Unigrams 

0.662 

0.672 

0.663 

0.678 

0.672 

0.684 

0.693 

0.694 

0.696 

0.670 

0.677 

0.681 

0.672 

0.694 

0.689 

0.677J 

0.681 

0.683 

Cheap  POS  Unigrams 

0.656 

0.665 

0.630 

0.661 

0.664 

0.659 

0.673 

0.679 

0.669 

0.675' 

0.655 

0.663 

0.659 

0.661 

0.669 

0.662 

0.658 

0.664 

LaPlace  Actual  POS  2-grams 

0.717 

0.721 

0.720 

0.729 

0.720 

0.736 

0.746 

0.742 

0.750 

0.730 

0.727 

0.736 

0.733 

0.741 

0.732 

0.736 

0.735 

0.740 

LaPlace  Cheap  POS  2-grams 

0.720 

0.728 

0.718 

0.721 

0.730 

0.735 

0.736 

0.740 

0.733 

0.738 

0.721 

0.7171 

0.712 

0.728 

0.727 

0.729 

0.728 

0.731 

Actual  POS  Bigrams 

0.729 

0.723 

0.729 

0.742 

0.726 

0.733 

0.744 

0.754 

0.747 

0.741 

0.731 

0.745 

0.727 

0.734 

0.738; 

0.736 

0.741 

0.744 

Cheap  POS  Bigrams 

0.732 

0.736 

0.726 

0.729 

0.731 

0.742 

0.747 

0.747 

0.749 

0.751 

0.726 

0.721 

0.727 

0.735 

0.735 

0.740 

0.734 

0.735 

Word/Actual-POS  pair  2-grams  +  POS  2-grams 

0.829 

0.820 

0.822 

0.826 

0.854 

0.839 

0.854 

0.846 

0.840 

0.838 

0.850 

0.832 

0.841 

0.850 

0.836 

0.824 

0.829 

0.836 

Word/Cheap-POS  pair  2-grams  +  POS  2-grams 

0.831 

0.825 

0.831 

0.819 

0.846 

0.846 

0.853 

0.834 

0.839 

0.832 

0.841 

0.834 

0.834 

0.839 

0.822 

0.823 

0.826 

0.824 

Actual  POS  Trigrams 

0.809 

0.811 

0.810 

0.817 

0.828 

0.816 

0.826 

0.820 

0.823 

0.823 

0.837 

0.813 

0.821 

0.836 

0.823 

0.811 

0.819 

0.820 

Cheap  POS  Trigrams 

0.815 

0.802 

0.802 

0.81lj 

0.831 

0.825 

0.835 

0.809 

0.809 

0.816 

0.827 

0.814 

0.823 

0.823 

0.809 

0.807| 

0.816 

0.814 

word  2-grams 

0.822 

0.823 

0.810 

0.822 

0.849 

0.821 

0.850 

0.823 

0.835 

0.823 

0.840 

0.826 

0.832 

0.837 

0.821 

0.819 

0.824 

0.821 

Run  Number: 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

Training  Posts 

9554 

9527 

9542 

9518 

9492 

9547 

9523 

9534 

9491 

9546 

9553 

9497 

9496 

9506 

9480 

9563 

9498 

9458 

Test  Posts 

1013 

1040 

1025 

1049 

1075 

1020 

1044 

1033 

1076 

1021 

1014 

1070 

1071 

1061 

1087 

1004 

1069 

1109 

MLE  performance 

0.316 

0.303 

0.286 

0.299 

0.311 

0.298 

0.291 

0.300 

0.299 

0.299 

0.296 

0.297] 

0.288 

0.293 

0.315 
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Figure  4.3:  Summary  of  Results  with  Emoticons  Tagged  as  “EMO' 


As  can  be  seen  in  Figure  4.3,  we  improved  the  accuracy  of  our  classifier  slightly.  This  suggests 
that  emoticons  may  serve  better  as  this  new  part  of  speech  rather  than  as  interjections.  Exam¬ 
ining  the  emoticons  in  use  today,  there  appear  to  be  two  distinct  types,  those  that  are  made  of 
combinations  of  punctuation  (“smileys”  such  as  and  those  that  are  acronyms  like  “lol”  for 

“laugh[ing]  out  loud.” 

4.2.4  Two  Types  of  Emoticon  Tags 

In  order  to  determine  if  our  dialog  act  classifier  performance  would  improve  if  we  recognized 
the  different  types  of  emoticons  as  two  different  parts  of  speech,  we  augmented  the  POS  dictio¬ 
nary  as  such.  Emoticons  based  on  acronyms  were  assigned  the  part  of  speech  tag  of  “EM02.” 
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Figure  4.4:  Summary  of  Results  with  Emoticons  Separated  into  Two  Groups 


Figure  4.4  shows  a  similar  performance  when  we  segregate  the  two  emoticon  types.  Separating 
the  emoticons  into  two  groups  based  on  type  actually  increased  our  classifiers  performance  by 
0.003%.  We  suspect  that  this  may  indicate  that  the  different  types  serve  different  syntactic 
purposes.  Further  analysis  of  this  phenomenon  was  not  completed  due  to  time  constraints. 

4.3  Analysis 

We  have  demonstrated  that  using  equation  3.1  provided  the  best  accuracy  for  our  cheap  POS 
method  and  that  our  method  equals  or  improves  accuracy  depending  on  which  tags  are  applied 
to  emoticons  as  compared  to  Forsyth  [6] . 

We  note  that  classification  based  on  word  bigrams  gives  an  overall  accuracy  of  82.63%,  actual 
POS  bigrams  result  in  73.59%  and  actual  POS  trigrams  81.96%.  This  suggests  that  sentence 
structure  rather  than  content  carries  the  dialog  act  signal.  Cheap  POS  bigrams  achieve  an  ac¬ 
curacy  of  73.47%  when  all  emoticons  are  given  a  common  tag.  Cheap  POS  trigrams  with  this 
tagging  scheme  result  in  an  overall  accuracy  of  81.57%,  only  0.39%  less  than  actual  POS  tri¬ 
grams.  Our  cheap  POS  tri grams  carry  the  dialog  act  signal  virtually  as  well  as  actual  POS 
trigrams. 

Appendix  C  contains  tables  showing  the  effects  of  our  various  POS  tagging  schemes.  We  pro¬ 
vide  the  counts  of  each  POS  by  dialog  act  type.  Figure  B .  1  contains  the  counts  of  these  tags  as 
applied  by  Forsyth.  Figures  B.2,  B.3,  B.4  and  B.5  show  the  cheap  POS  counts  as  applied  by 
our  methodology.  We  note  the  shifts  in  “UNK,”  “UH,”  “EMO”  and  “EM02”  counts  according 
to  tagging  scheme  as  each  figure’s  caption  indicates.  These  experiments  serve  as  a  preliminary 
exploration  of  Harris’  “Distributional  Hypothesis”  [25]. 

In  order  to  demonstrate  statistical  significance  in  our  experiments,  we  chose  to  compare  the 
performance  of  word  bigrams,  cheap  POS  and  actual  POS  for  this  task,  we  chose  the  Wilcoxon 
Signed-Rank  Pair  test  and  selected  a  confidence  level  of  99%. 

4.3.1  Statistical  Significance  with  Emoticons  Unrecognized 

Figure  4.5  displays  the  distributions  of  overall  accuracies  when  emoticons  are  not  recognized 
and  are  therefore  tagged  as  “UNK.”  We  can  see  overlap  in  the  performance  of  our  classifier 
using  our  selected  feature  sets. 

We  applied  the  Wilcoxon  Signed-Rank  Pair  test  between  word  bigrams  and  cheap  POS  results 
using  equation  3.1  with  a  resulting  p  value  of  0.0000181.  There  is  only  a  remote  possibility  that 
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Figure  4.5:  Bar  Plot  of  Accuracies  with  Emoticons  Unrecognized 

this  is  a  result  of  random  chance. 

Applying  the  same  test  between  the  cheap  and  actual  POS  results  also  exceeded  our  99%  confi¬ 
dence  level  with  a  p  value  of  0.00025.  We  conclude  that  we  have  strong  statistical  significance 
in  our  method’s  performance. 


4.3.2  Statistical  Significance  with  Emoticons  Tagged  as  Interjections 

Figure  4.6  shows  the  distributions  of  overall  accuracies  when  we  concur  with  Forsyth’s  decision 
to  tag  emoticons  as  interjections.  The  word  bigrams  and  features  using  actual  POS  marks  data 
show  no  change  from  the  previous  figure  and  are  provided  for  easy  reference.  We  see  the  general 
improvement  in  cheap  POS  feature  performance  as  a  slight  upward  trend. 

We  applied  the  Signed-Rank  Pair  test  between  word  bigram  and  cheap  POS  performance  and 
computed  a  p  value  of  0.000005 1 .  We  conclude  statistical  significance  in  our  method. 

We  also  applied  the  test  between  cheap  POS  and  actual  POS  data  with  a  resulting  p  value  of 
0.00017. 
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Figure  4.6:  Bar  Plot  of  Accuracies  with  Emoticons  Tagged  as  Interjections 
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Figure  4.7:  Bar  Plot  of  Accuracies  with  All  Emoticons  Tagged  as  “EMO” 
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4.3.3  Statistical  Significance  with  Emoticons  Tagged  with  “EMO” 

We  find  in  Figure  4.7  that  marking  emoticons  with  a  single,  unique  tag  gives  better  results 
than  using  the  interjection  tag.  In  fact,  this  emoticons  tagging  scheme  produces  better  average 
accuracy  than  the  previous  work. 

We  continued  to  use  the  Wilcoxon  Signed-Rank  Pair  test  with  a  confidence  level  of  99%.  When 
we  compared  word  bigrams  with  the  performance  of  cheap  POS,  our  p  value  was  0.0000013. 
The  test  resulted  in  a  p  value  of  0.00191  when  comparing  cheap  POS  performance  to  actual 
POS  performance. 

We  continue  to  demonstrate  statistical  significance. 

4.3.4  Statistical  Significance  with  Emoticons  Tagged  as  Two  Types 


Figure  4.8:  Bar  Plot  of  Accuracies  with  Emoticons  Tagged  as  “EMO”  and  “EM02” 

Figure  4.8  represents  our  final  emoticon  tagging  scheme  and  shows  a  very  slight  increase  in 
average  accuracy  over  the  previous  scheme.  This  average  also  matches  Forsyth’s  best.  Signifi¬ 
cance  testing,  conducted  as  in  previous  experiments,  gives  a  p  value  of  0.000019  in  comparison 
of  word  bigrams  and  cheap  POS.  We  achieved  a  p  value  of  0.00061  when  testing  cheap  and 
actual  POS  performances. 
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4.3.5  Dialog  Act  or  Authorship  Identification? 

In  order  to  determine  that  our  classifier  results  were  not  influenced  by  the  characteristics  of 
prolific  chat  participants,  we  used  Forsyth’s  original  data  to  map  masked  user  names  to  their 
screen  names.  We  were  able  to  attribute  9,856  posts  to  1,122  individuals.  We  split  the  correlated 
data  using  90%  of  the  identified  authors  for  training  and  the  other  10%  for  testing.  No  posts 
from  the  tested  authors  were  included  in  the  training  set.  We  performed  testing  over  10  such 
splits  with  the  overall  accuracies  provided  in  Table  4.1:  Note  that  the  number  of  posts  used 


Run  Number: 

Mean 

Max 

Min 

Training  Posts 

8828.8 

9277 

8653 

Test  Posts 

1027.2 

1203 

579 

MLE  performance 

0.2988 

0.3310 

0.2418 

Actual  POS  Unigrams 

0.6920 

0.7513 

0.6540 

Cheap  POS  Unigrams 

0.6802 

0.7427 

0.6238 

LaPlace  Actual  POS  2-grams 

0.7406 

0.7910 

0.7037 

LaPlace  Cheap  POS  2-grams 

0.7357 

0.7807 

0.6852 

Actual  POS  Bigrams 

0.7395 

0.7997 

0.7027 

Cheap  POS  Bigrams 

0.7410 

0.7841 

0.6988 

Actual  word/POS  2-grams  +  POS  2-grams 

0.8350 

0.8722 

0.8051 

Cheap  POS  word/POS  2-grams  +  POS  2-grams 

0.8337 

0.8756 

0.7973 

Actual  POS  Trigrams 

0.8160 

0.8411 

0.7836 

Cheap  POS  Trigrams 

0.8131 

0.8549 

0.7700 

word  2-grams 

0.8231 

0.8549 

0.7856 

Table  4.1 :  Average  Dialog  Act  Tagging  Accuracies  Leaving  10%  of  Authors  Out 


for  testing  varies  significantly  with  a  maximum  of  1,203  and  minimum  of  579.  This  is  due 
to  the  wide  variation  in  individual  user  contributions.  Figure  4.9  provides  a  histogram  of  the 
number  of  authors  with  post  count  bins  on  the  x-axis.  Note  that  913  authors  (81.4%)  of  the 
identifiable  authors  produced  10  or  less  posts  while  the  most  prolific  author  provided  more  than 
130  posts.  Splitting  the  data  set  by  author  and  the  disparity  in  levels  of  author  participation  are 
responsible  for  our  test  population  variance.  Table  4. 1  shows  that  it  is  unlikely  that  our  dialog 
act  classification  method  is  influenced  by  author  characteristics. 

We  have  demonstrated  a  technique  that  provides  improved  dialog  act  tagging  accuracy  in  the 
chat  domain.  We  have  also  shown  statistical  significance  in  our  method’s  performance  and  that 
our  results  are  not  skewed  by  author  characteristics.  While  prior  work  in  this  domain  has  relied 
on  time  consuming,  human-verified  part  of  speech  tagging,  our  method  demonstrates  that  this 
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Figure  4.9:  Histogram  of  Author  Post  Counts 


investment  is  not  required  for  effective  dialog  act  tagging  in  the  chat  domain. 

With  our  presentation  of  experiment  results  and  analysis  complete,  we  provide  our  conclusions 
and  recommendations  for  future  work  in  Chapter  5. 
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CHAPTER  5: 

CONCLUSIONS  AND  FUTURE  WORK 


5.1  Conclusions 

Part  of  speech  tagging  is  useful  in  dialog  act  tagging  as  shown  in  Forsyth  [6]  and  Wu  et  al. 
[8].  Unfortunately,  at  the  current  state-of-the-art,  accurate  grammatical  tagging  requires  hand- 
annotation  in  the  chat  domain.  We  hypothesize  that  by  using  cross-domain  MLE  part  of  speech 
tags,  similar  dialog  act  tagging  performance  is  achievable  with  significantly  less  effort  vis-a-vis 
hand  POS  tagging. 

The  methodology  presented  in  Chapter  3  performs  virtually  as  well  as  using  actual,  hand-tagged 
part  of  speech  tags  without  the  preprocessing  time  and  effort.  Our  experiments  show  that  for 
the  chat  domain,  accurate  POS  tags  are  not  required  to  effectively  determine  chat  post  dialog 
act  tags.  Though  our  results  show  a  minimal  decrease  in  overall  accuracy  when  compared  to 
same  experiments  using  actual  parts  of  speech,  we  required  no  preprocessing  nor  hand-tagging 
of  parts  of  speech.  Further,  cheap  POS  tagging  is  extremely  fast.  We  also  showed,  through 
statistical  significance  testing  that  our  method’s  performance,  with  high  probability,  is  not  the 
result  of  chance. 

While  using  actual  POS  tags  performed  only  0.3%  better  than  cheap  POS  tags,  for  accurate  dia¬ 
log  act  determination,  we  required  only  the  processing  time  required  to  load  our  POS  dictionary 
and  apply  these  tags. 

5.1.1  Uses  for  Dialog  Acts 

Stolcke  suggests  that  consensus  is  building  in  the  Natural  Language  Processing  community 
that  dialog  act  tags  are  useful  for  higher-order  linguistic  analysis  [12].  Dialog  act  tags  have 
been  used  in  multi-party  meeting  summarization  (Yang  et  al.  [26])  and  spoken  dialog  systems 
(Walker  and  Passionneau  [27]).  Spoken  dialog  systems  can  use  dialog  act  tags  to  improve 
response  accuracy. 

5.1.2  Implications  for  Tactical  Military  Chat 

Eovito’s  thesis  provides  a  list  of  functional  requirements  for  tactical  military  chat.  We  believe 
some  of  these  requirements  may  benefit  from  automatically  determined  dialog  act  information. 
For  example,  Eovito’  core  requirements  include  Thread  Population/Repopulation.  This  function 
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is  designed  to  provide  new  or  returning  users  a  recapitulation  of  recent  tactical  chat  events  [2] . 
Rather  than  present  these  participants  with  a  temporally  indiscriminant  list  of  messages,  we 
believe  that  dialog  acts  could  be  used  to  filter  the  information  provided  to  include  dialog  acts 
of  interest.  For  example,  the  system  could  be  configured  to  display  recent  questions  and  their 
corresponding  answers.  Direction  from  higher-authority  in  the  form  of  statements  could  also  be 
highlighted  thus  filtering  unnecessary  noise  and  providing  improved  situational  awareness  with 
less  effort  required  by  the  user. 

An  additional  requirement  identified  by  Eovito  is  Chat  Logging,  or  preserving  chat  data  for  his¬ 
torical  record  [2].  While  this  may  simply  involve  saving  files,  we  believe  that  post-processing 
tasks  would  benefit  from  our  methodology.  By  automatically  identifying  dialog  acts  and  using 
these  new  features,  we  could  separate  the  inherently  interleaved  conversations  thus  automati¬ 
cally  providing  a  summary  of  who  said  what  to  whom  and  when.  We  believe  this  information 
could  then  be  used  to  generate  lessons  learned  for  individuals,  units  and  operational  planners. 

While  our  method  will  require  addition  of  further  functionality  to  achieve  these  goals,  we  be¬ 
lieve  that  we  provide  an  enabling  foundation  for  further  development. 

5.2  Contributions 

Our  experiments  serve  to  expand  the  field  and  include: 

•  We  developed  a  cross-genre  POS  tagging  methodology.  This  pushes  the  field  forward  in 
that  it  was  previously  known  that  MLE  within  genre  works  well;  our  contribution  shoes 
that  MLE  cross-genre  is  effective  in  the  chat  domain.  We  refer  to  this  as  “Cheap”  POS 
(or  CPOS)  tagging.  This  opens  the  door  for  more  research  in  domains  where  there  is  little 
labeled  data. 

•  We  further  validated  the  benefits  of  CPOS  tagging  by  comparing  it  against  hand-tagged 
POS  for  dialog  act  prediction.  Our  research  shows  that  the  extra  work  required  for  hand 
labeling  is  unnecessary.  Simply  using  pre-existing  labeled  data  from  other  genres  is  as 
effective  without  the  time  and  cost  investment. 

•  We  empirically  verified  Harris’  “Distribution  Hypothesis”  as  applied  to  emoticons.  When 
we  treat  emoticons  as  distinct  parts-of-speech,  with  their  own  n-gram  distributions,  our 
results  are  better. 
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We  accomplished  significant  feature  engineering  to  discover  effective  combinations  of 
features  for  dialog  act  tagging.  Further  research  is  needed,  but  we  believe  these  features 
will  be  useful  for  down-stream  analysis. 


5.3  Future  Work 

While  we  have  provided  useful  results,  we  recommend  the  following  research  with  the  goal  of 
improving  on  this  foundation: 

•  For  most  machine  learning  techniques,  more  training  data  is  generally  desired.  We  recom¬ 
mend  continuing  Forsyth’s  work  in  privacy  masking  and  tagging  more  of  the  chat  corpus. 
Results  for  this  and  other  methods  would  benefit  from  expanded  training  data.  Our  work 
should  prove  useful  in  expanding  the  size  of  the  NPS  chat  corpus  after  anonymizing  a 
larger  portion  of  the  raw  data  collected  by  Lin. 

•  Traditional  Naive  Bayes  classifiers  use  the  formula 

C  =  argmax  log P(C)  +  ^ logP(/i|(7). 

CEClasses 

i 

In  the  course  of  our  experimentation,  we  noted  that  our  classifier  determined  the  correct 
class  within  the  top  two  results  over  89%  of  the  time  using  the  formula 

C  =  argmax  logP(C)  +  ^  log  P(wpp.i\C)  +  ^  log P(pbj\C). 

CtzClasses 

i  3 

where  wppi  is  word/POS  pair  i  and  pbj  is  POS  bigram  j.  Our  recommendation  is  an 
exploration  of  cascading  naive  Bayes  results  to  another  classifier  in  order  to  improve 
dialog  act  tagging  accuracy.  As  noted  in  Chapter  4,  our  results  decreased  slightly  when 
we  segregated  emoticons  that  differed  in  form  by  tagging  them  with  different  parts  of 
speech.  Additional  training  data  is  needed  to  determine  if  this  decrease  in  performance 
is  due  to  these  features  similarity  or  if  segregating  them  reduced  our  classifiers  ability  to 
overcome  the  widely  disparate  dialog  act  class  prior  probabilities. 

•  Per  Forsyth’s  recommendations,  we  showed  that  emoticons  may  be  better  tagged  with 
a  POS  tag  different  from  “UH”  [6].  An  exploration  of  this  phenomenon  should  include 
other  potential  tagging  schemes  for  these  features. 
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•  The  use  of  this  method  of  dialog  act  determination  should  be  explored  in  the  tactical 
military  chat  domain.  Additional  effort  should  be  directed  to  thread  extraction  in  this 
critical  command  and  control  subsystem  to  provide  the  functionality  proposed  above. 

•  We  also  believe  that  our  method  of  dialog  act  tagging  chat  posts  would  translate  to  similar 
results  on  Short  Message  Service  (SMS)  data.  The  popularity  of  this  form  of  computer 
mediated  communications  continues  to  grow.  A  corpus  of  privatized  text  messages  should 
be  constructed  for  analysis. 

•  We  initially  hypothesized  that  cheap  POS  tags  could  be  useful  in  authorship  identification. 
While  we  performed  no  work  to  validate  this  theory,  we  believe  that  it  should  be  explored. 


5.4  Final  Conclusion 

We  present  a  methodology  that  capitalizes  on  previous,  human-involved  POS  tagging  efforts 
to  effectively  determine  dialog  acts  in  the  chat  domain.  We  hypothesize  that  methods  similar 
to  ours  are  useful  for  analysis  of  emerging  domains.  This  research  is  an  initial  foray  into  the 
cross-genre  POS  domain  providing  a  foundation  to  improve  methods  in  other  areas  of  interest 
for  natural  language  processing. 
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APPENDIX  A: 
EMOTICON  DICTIONARY 
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Table  A.1 :  Partial  Emoticon  Dictionary  from  Wikipedia 
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APPENDIX  B: 

EFFECTS  OF  CHEAP  POS  METHOD 


This  appendix  contains  tables  showing  the  redistribution  of  POS  tags  by  our  different  emoticon 
tagging  schemes. 

Figure  B.l  shows  the  distribution  of  POS  tag  counts  across  all  dialog  act  classes  as  tagged  by 
Forsyth  [6]  and  serves  as  a  baseline  for  comparison. 


53 


Actual  POS  counts: 
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NNP 
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VBP 
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Figure  B.1:  Actual  POS  Counts 


Figure  B.2  shows  the  distribution  of  POS  tag  counts  when  emoticons  are  unrecognized  and  are 
tagged  as  “UNK”.  We  can  observe  the  changes  in  POS  tag  counts  resulting  from  our  cheap  POS 
methodology.  Specifically,  note  the  increased  size  of  the  UNK  category.  Shifts  in  the  noun  and 
verb  categories  are  also  evident  as  a  result  of  our  maximum  likelihood  estimation  approach  to 
POS  tagging. 
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Figure  B.2:  POS  Counts  with  Emoticons  Unrecognized 


Figure  B.3  shows  the  changes  in  the  “UFT  category  as  emoticons  were  moved  from  the  “UNK” 
POS  counts. 
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Figure  B.3:  POS  Counts  with  Emoticons  Tagged  as  Interjections 


In  Figure  B.4,  we  have  tagged  all  emoticons  with  the  unique  “EMO”  tag.  Note  the  changes 
from  the  “UH”  category  to  the  “EMO”  column. 
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Figure  B.4:  POS  Counts  with  Emoticons  Tagged  with  our  EMO  Tag 


Finally,  Figure  B.5  displays  the  changes  in  POS  tag  counts  when  emoticons  are  separated  into 
two  groups. 
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Figure  B.5:  POS  Counts  with  Emoticons  Separated  into  Two  Groups 


APPENDIX  C: 
CONFUSION  MATRICES 


This  appendix  contains  confusion  matrices  for  selected  experiments.  These  are  separated  by 
specific  emoticon  tagging  schemes  and  experiment  numbers  as  found  in  the  caption  of  each 
table. 

Figures  C.l  through  C.10  show  the  results  of  corresponding  experiment  runs  with  emoticons 
unrecognized  (tagged  as  “UNK”). 
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Figure  C.1:  Experiment  Run  5:  Emoticons  Unrecognized 
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Figure  C.4:  Experiment  Run  20:  Emoticons  Unrecognized 
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Figure  C.5:  Experiment  Run  25:  Emoticons  Unrecognized 
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Figure  C.7:  Experiment  Run  35:  Emoticons  Unrecognized 
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Figure  C.8:  Experiment  Run  40:  Emoticons  Unrecognized 
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Figure  C.9:  Experiment  Run  45:  Emoticons  Unrecognized 
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Figure  C.10:  Experiment  Run  50:  Emoticons  Unrecognized 


Figures  C.ll  through  C.20  show  the  results  of  corresponding  experiment  runs  with  emoticons 
tagged  as  “UFT  per  Forsyth’s  methodology. 
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Figure  C.1 1 :  Experiment  Run  5:  Emoticons  Assigned  “UH”  Tag 
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Figure  C.12:  Experiment  Run  10:  Emoticons  Assigned  “UH”Tag 
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Figure  C.13:  Experiment  Run  15:  Emoticons  Assigned  “UH”Tag 
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Figure  C.14:  Experiment  Run  20:  Emoticons  Assigned  “UH”  Tag 
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Figure  C.15:  Experiment  Run  25:  Emoticons  Assigned  “UH”  Tag 
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Figure  C.16:  Experiment  Run  30:  Emoticons  Assigned  “UH”  Tag 
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Figure  C.17:  Experiment  Run  35:  Emoticons  Assigned  “UH”  Tag 
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Figure  C.18:  Experiment  Run  40:  Emoticons  Assigned  “UH”  Tag 
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Figure  C.19:  Experiment  Run  45:  Emoticons  Assigned  “UH”  Tag 
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Figure  C.20:  Experiment  Run  50:  Emoticons  Assigned  “UH”  Tag 


Figures  C.21  through  C.30  show  the  results  of  corresponding  experiment  runs  with  emoticons 
tagged  as  “EMO”. 
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Figure  C.21 :  Experiment  Run  5:  Emoticons  Assigned  “EMO”  Tag 
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Figure  C.22:  Experiment  Run  10:  Emoticons  Assigned  “EMO”  Tag 
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Figure  C.23:  Experiment  Run  15:  Emoticons  Assigned  “EMO”  Tag 
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Figure  C.24:  Experiment  Run  20:  Emoticons  Assigned  “EMO”  Tag 
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Figure  C.25:  Experiment  Run  25:  Emoticons  Assigned  “EMO”  Tag 
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Figure  C.26:  Experiment  Run  30:  Emoticons  Assigned  “EMO”  Tag 
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Figure  C.27:  Experiment  Run  35:  Emoticons  Assigned  “EMO”  Tag 
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Figure  C.28:  Experiment  Run  40:  Emoticons  Assigned  “EMO”  Tag 
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Figure  C.29:  Experiment  Run  45:  Emoticons  Assigned  “EMO”  Tag 
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Figure  C.30:  Experiment  Run  50:  Emoticons  Assigned  “EMO”  Tag 


Figures  C.31  through  C.40  show  the  results  of  corresponding  experiment  runs  with  emoticons 
segregated  into  two  categories  based  on  type. 
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Figure  C.31 :  Experiment  Run  5:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 
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Figure  C.32:  Experiment  Run  1 0:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 
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Figure  C.33:  Experiment  Run  1 5:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 
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Figure  C.34:  Experiment  Run  20:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 


Using  Actual  POS 

tags 

i  €/ 

/  &  /<er  /sp  / Xi  srP-  Per  / 

G° 

O'  Precision 

Recall 

Overall 

F-score  Accuracv 

Statement 

250 

4 

6 

18  8  1  7 

9 

6 

8 

11 

0 

4  0 

4 

0.7440 

0.8224 

0.7813  82.66% 

System 

3 

275 

0 

0  0  0  0 

0 

0 

0 

0 

0 

0  0 

0 

0.9892 

0.9857 

0.9874 

Greet 

17 

0 

116 

110  0 

0 

3 

1 

0 

0 

0  0 

0 

0.8345 

0.9063 

0.8689 

Emotion 

6 

0 

0 

106  0  0  2 

0 

0 

0 

0 

0 

0  0 

0 

0.9298 

0.8346 

0.8797 

ynQuestion 

7 

0 

2 

0  38  2  0 

0 

0 

0 

0 

0 

0  0 

0 

0.7755 

0.7308 

0.7525 

whQuestion 

4 

0 

1 

0  4  39  0 

1 

0 

0 

0 

0 

0  0 

0 

0.7959 

0.9070 

0.8478 

Accept 

7 

0 

0 

10  0  5 

0 

0 

0 

0 

3 

0  0 

0 

0.3125 

0.2632 

0.2857 

Emphasis 

3 

0 

2 

10  0  1 

7 

1 

0 

0 

0 

0  0 

0 

0.4667 

0.3889 

0.4242 

Bye 

2 

0 

0 

0  0  0  0 

0 

14 

0 

0 

0 

0  0 

0 

0.8750 

0.5600 

0.6829 

Continuer 

2 

0 

0 

0  0  0  0 

1 

0 

4 

0 

1 

0  0 

0 

0.5000 

0.2857 

0.3636 

Reject 

2 

0 

0 

0  0  10 

0 

0 

0 

4 

0 

2  0 

0 

0.4444 

0.2667 

0.3333 

yAnswer 

0 

0 

1 

0  10  4 

0 

0 

0 

0 

3 

0  0 

0 

0.3333 

0.4286 

0.3750 

nAnswer 

0 

0 

0 

0  0  0  0 

0 

0 

1 

0 

0 

2  1 

0 

0.5000 

0.2500 

0.3333 

Other 

0 

0 

0 

0  0  0  0 

0 

1 

0 

0 

0 

0  0 

0 

0.0000 

0.0000 

undef 

Clarify 

1 

0 

0 

0  0  0  0 

0 

0 

0 

0 

0 

0  0 

0 

0.0000 

0.0000 

undef 

Using  Cheap  POS 

Co 

i  tags 

^  p  'J.  o 

co?  cf  r  ^ 

G° 

<&  J?  Jp 

/ r  &  Precision 

Recall 

F-score  Overall  Acc 

Statement 

260 

2 

7 

21  9  1  7 

11 

7 

9 

9 

2 

5  0 

4 

0.7345 

0.8553 

0.7903  83.24% 

System 

2 

276 

0 

0  0  0  0 

0 

0 

0 

0 

0 

0  0 

0 

0.9928 

0.9892 

0.9910 

Greet 

16 

0 

116 

0  1  1  0 

0 

2 

1 

0 

0 

0  0 

0 

0.8467 

0.9063 

0.8755 

Emotion 

3 

0 

0 

105  0  0  2 

0 

1 

0 

0 

0 

0  0 

0 

0.9459 

0.8268 

0.8824 

ynQuestion 

9 

0 

2 

0  37  2  0 

0 

0 

0 

0 

0 

0  0 

0 

0.7400 

0.7115 

0.7255 

whQuestion 

2 

0 

1 

0  3f  38  0 

1 

0 

0 

0 

0 

0  0 

0 

0.8444 

0.8837 

0.8636 

Accept 

6 

0 

0 

0  10  5 

0 

0 

0 

0 

4 

0  0 

0 

0.3125 

0.2632 

0.2857 

Emphasis 

4 

0 

2 

10  0  1 

6 

0 

0 

0 

0 

0  0 

0 

0.4286 

0.3333 

0.3750 

Bye 

0 

0 

0 

0  0  0  0 

0 

15 

0 

0 

0 

0  0 

0 

1.0000 

0.6000 

0.7500 

Continuer 

1 

0 

0 

0  0  0  0 

0 

0 

4 

0 

0 

0  0 

0 

0.8000 

0.2857 

0.4211 

Reject 

0 

0 

0 

0  0  1  0 

0 

of 

0 

4 

0 

-~L  o 

0 

0.6667 

0.2667 

0.3810 

yAnswer 

0 

0 

0 

0  10  4 

0 

0 

0 

0 

1 

0  0 

0 

0.1667 

0.1429 

0.1538 

nAnswer 

0 

1 

0 

0  0  0  0 

0 

0 

0 

2 

0 

2  1 

0 

0.3333 

0.2500 

0.2857 

Other 

0 

0 

0 

0  0  0  0 

0 

0 

0 

0 

0 

0  0 

0 

undef 

0.0000 

undef 

Clarify 

1 

0 

0 

0  0  0  0 

0 

0 

0 

0 

0 

0  0 

0 

0.0000 

0.0000 

undef 

_ 

Figure  C.35:  Experiment  Run  25:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 
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Figure  C.36:  Experiment  Run  30:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 
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Figure  C.37:  Experiment  Run  35:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 
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Figure  C.38:  Experiment  Run  40:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 
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Figure  C.39:  Experiment  Run  45:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 
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Figure  C.40:  Experiment  Run  50:  Emoticons  Assigned  “EMO”  or  “EM02”  Tags 
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